Open xuewyang opened 4 years ago
Hi,
I just finished reimplemented your model because I am not good at allennlp and I have to build my own methods. I found 3 interesting points:
Thanks.
The adaptive embedding method is used mainly to save memory so that we don't have to compute softmax on the entire 50K tokens at every step. Check out the paper that proposed it.
When you say 1 epoch, do you mean going through the whole dataset once? The way an epoch is defined in my implementation is actually only 65536 captions (instead of the entire dataset). See the parameter instances_per_epoch
in the config files. It's defined like this so that we can save a checkpoint every hour. My model would probably also take 7 hours to go through every training caption once.
Difficult to tell the speed just by looking at the code. The best way to figure out the slowest part of the code is to use profiling. You can run python -m cProfile -o profile_results.prof your_training_script.py
. And then you can visualise and see exactly how long each function in your code takes with snakeviz profile_results.prof
(install snakeviz with pip install snakeviz
).
Ah good catch. Looks like I forgot to set both roberta
and resnet
to eval mode. It should've been done just after we initialize these objects. You might get a better text/image representation after the fix (I think all roberta.eval()
does is turning off dropout). The weights of roberta
and resnet
are always frozen - this is controlled by the no_grad
parameter in the config files.
Gotcha.
But adaptive embedding is not using roberta embedding right? I think adaptive embedding is based on nn.embedding, I see this code: embed = nn.Embedding(vocab_size, embed_size, padding_idx)
1 epoch means the who training dataset > 420K. It seems like we have the same training speed as yours only contains 65536 captions. About the dataset, I know you are padding the caption tokens to the max of all captions in a batch. Do you achieve this by the following code? How the 'desired_num_tokens[key]' is defined? @overrides def as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) -> Dict[str, torch.Tensor]: # pylint: disable=unused-argument padded_dict: Dict[str, torch.Tensor] = {} for key, val in tokens.items(): if 'copy_masks' in key: def default_value(): return -1 else: def default_value(): return self._padding_value padded_val = pad_sequence_to_length(sequence=val, desired_length=desired_num_tokens[key], default_value=default_value, padding_on_right=self._padding_on_right) padded_dict[key] = torch.LongTensor(padded_val) return padded_dict
I am trying to put roberta.eval() in the code, but it seems like we will use TransformerFlatten.train() and roberta is in TransformerFlatten class. I don't think roberta.eval() will truly work. Maybe you have some ideas?
In the config file, I am wondering how the following lines are working: parameter_groups:
For example, how in allennlp or your code, no_grad is taking action? In which .py file, no_grad is used? And in which file, parameter_groups is working?
1) You're right. The only thing the decoder and the roberta encoder have in common is the vocabulary file. The embeddings are different. I remember trying the idea of initialising the adaptive embeddings with the pretrained embedding matrix from roberta, but the performance wasn't as good, so I just ended up randomly initializing the adaptive embeddings. You can try this idea again and see if you can get better results.
2) Yeah that's where it's padded. desired_num_tokens
is defined here here (also check out get_padding_lengths
above it). Btw I think you accidentally mentioned the user OVERRIDES in your comment :-P Better to use a code block next time.
3) I guess it should be safe setting roberta.eval()
just before you call extract_features
. It will just overwrite whatever mode set by the parent. In Pytorch, when you run roberta.eval()
, it will only affect the roberta
module and its children. The parent class is unaffected. See how train()
and eval()
are implemented here.
4) no_grad
is used by AllenNLP here. It just sets the requires_grad
attribute of parameters that match the regex to False. parameter_groups
is not really used anywhere. If you want to set different learning rates on different layers, you can add custom parameters inside those {}
brackets and it will get parsed here and then passed down to the optimizer (e.g. Adam).
Yes, thank you. I will use a code block next time.
Hi, Can you explain the following quesitons?
Thanks.
For the eval() mode problem, what we only can do to solve this problem is call eval() every time before the extract_feature, like this: self.roberta = self.roberta.eval() X_sections_hiddens = self.roberta.extract_features( article_ids, return_all_hiddens=True) After testing, I found that we cannot eval() only in the init function.
1) Yeah I don't think that code will ever run when we force all generated captions to be less than 100 tokens. register_buffer
tells pytorch to not compute the gradient. We want this for the Sinusoidal Positional Embeddings since they just a bunch of sines and cosines, i.e. the embedding for each position is always fixed. For TokenEmbedder
, we need the gradients to learn the embedding weights.
2) Perhaps you're looking at different batches? The shape [15, 72] should mean that there are 15 samples in that batch and the longest article in that batch has 72 tokens. Inside that loop, try printing out batch['metadata']['context']
and that will tell you the actual text that the IDs in batch['context']['roberta']
encode. At a high level, all roberta_indexer
does is to convert the raw text into the IDs. The data iterator then shuffles the samples and yield batches with similar sizes. Because of the shuffling, the easiest way to find out the raw text corresponding to the IDs is to look at batch['metadata']
.
3) Yep it makes sense to put eval() directly before calling extract_features
since the parent class probably overwrites the mode at the start of each epoch.
1) I just ran the evaluation script and added the following debug code:
print('tensor shape:', batch['context']['roberta'].shape)
print('raw text length:', end=' ')
for sample in batch['metadata']:
print(len(sample['context'].split()), end=' ')
Example output:
tensor shape: torch.Size([15, 385])
raw text length: 194 239 203 258 243 215 275 199 275 257 258 224 267 264 280
This looks reasonable to me. Give me a debug code to reproduce your problem if you're still having issues.
2) Both sorting_keys
and maximum_samples_per_batch
are used by the Bucket Iterator and its parent class. You can start here.
2) I simply assumed the roberta vocab is already sorted by frequency, so I didn't do any further sorting. I couldn't find any reference for this, but scanning through the first few hundred tokens in the vocab, it seems like a reasonable assumption.
Maybe I understand now. If we take a look at the following, the tensor shape and the raw text length is getting bigger with more iterations. So my guess is that the dataloader will group samples of similar sizes together and process together. The samples are sorted by the sorting keys.
tensor shape: torch.Size([15, 54])
raw text length: 16 19 22 30 29 31 32 32 38 34 29 26 39 39 33
tensor shape: torch.Size([15, 72])
raw text length: 33 33 30 40 36 41 38 48 47 38 49 43 55 55 50
tensor shape: torch.Size([15, 93])
raw text length: 50 63 52 55 54 60 65 61 71 57 60 57 63 59 56
.........................................
torch.Size([15, 266])
raw text length: 193 180 176 174 190 166 181 174 213 188 213 182 160 195 182
Do you have any experience with learning rate tuning? I now have batch size of 10 times more. Should I try bigger lr? Also, you are using bert adam optimizer, but the model is not bert, why not using adam or adamw?
I think you just need to do some hyperparameter search on the validation set to find the best learning rate. I don't think we can say in advance what the best lr is.
The bert adam optimizer is just adamw that has a linear learning rate scheduler with warmup built in.
Hi Alasdair,
It seems that you are using an adaptive embedding method defined in adaptive.py. I am wondering why not use roberta embedding method since you are using roberta as the encoder. Thanks.
Xuewen