adaptive embedder - Githubissues

xuewyang commented 4 years ago

Hi Alasdair,

It seems that you are using an adaptive embedding method defined in adaptive.py. I am wondering why not use roberta embedding method since you are using roberta as the encoder. Thanks.

Xuewen

xuewyang commented 4 years ago

Hi,

I just finished reimplemented your model because I am not good at allennlp and I have to build my own methods. I found 3 interesting points:

my batch size now is bigger 25 vs 16 as you used (the transformer_flattened one) on a 11 gb 1080-ti.
my speed is so slow, training takes about 7 hours vs 1 hour for one epoch as you did. I am thinking maybe because I am encoding the article one by one and you are doing batch encoding? I have the following code in my dataset.py where I defined a dataset class. The self.tokenizer_cap are roberta tokenzier as you defined. Can you tell me how your batch data is processed in allennlp? def getitem(self, i): image_path = os.path.join(self.image_dir, self.img_ids[i] + '.jpg') image = Image.open(image_path) image = self.preprocess(image) caption = self.captions[i] caption_ids = self.tokenizer_cap.encode(caption) context = self.articles[i] context_ids = self.tokenizer_con.encode(context) return { 'image': image, 'caption_text': caption, 'cap_input_ids': torch.tensor(caption_ids), 'con_input_ids': torch.tensor(context_ids), }
I know you are not fine-tuning roberta, but you didn't do roberta.eval() before roberta.extract_features. I am wondering where did you eval() the roberta or parameter_require_grad = False as in general pytorch codes.

Thanks.

alasdairtran commented 4 years ago

The adaptive embedding method is used mainly to save memory so that we don't have to compute softmax on the entire 50K tokens at every step. Check out the paper that proposed it.

When you say 1 epoch, do you mean going through the whole dataset once? The way an epoch is defined in my implementation is actually only 65536 captions (instead of the entire dataset). See the parameter instances_per_epoch in the config files. It's defined like this so that we can save a checkpoint every hour. My model would probably also take 7 hours to go through every training caption once.

Difficult to tell the speed just by looking at the code. The best way to figure out the slowest part of the code is to use profiling. You can run python -m cProfile -o profile_results.prof your_training_script.py. And then you can visualise and see exactly how long each function in your code takes with snakeviz profile_results.prof (install snakeviz with pip install snakeviz).

Ah good catch. Looks like I forgot to set both roberta and resnet to eval mode. It should've been done just after we initialize these objects. You might get a better text/image representation after the fix (I think all roberta.eval() does is turning off dropout). The weights of roberta and resnet are always frozen - this is controlled by the no_grad parameter in the config files.

xuewyang commented 4 years ago

Gotcha.

But adaptive embedding is not using roberta embedding right? I think adaptive embedding is based on nn.embedding, I see this code: embed = nn.Embedding(vocab_size, embed_size, padding_idx)
1 epoch means the who training dataset > 420K. It seems like we have the same training speed as yours only contains 65536 captions. About the dataset, I know you are padding the caption tokens to the max of all captions in a batch. Do you achieve this by the following code? How the 'desired_num_tokens[key]' is defined? @overrides def as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) -> Dict[str, torch.Tensor]: # pylint: disable=unused-argument padded_dict: Dict[str, torch.Tensor] = {} for key, val in tokens.items(): if 'copy_masks' in key: def default_value(): return -1 else: def default_value(): return self._padding_value padded_val = pad_sequence_to_length(sequence=val, desired_length=desired_num_tokens[key], default_value=default_value, padding_on_right=self._padding_on_right) padded_dict[key] = torch.LongTensor(padded_val) return padded_dict
I am trying to put roberta.eval() in the code, but it seems like we will use TransformerFlatten.train() and roberta is in TransformerFlatten class. I don't think roberta.eval() will truly work. Maybe you have some ideas?
In the config file, I am wondering how the following lines are working: parameter_groups:
- - - ^decoder.embedder
  - {}
- - - ^decoder.layers.0
  - {}
- - - ^decoder.layers.1
  - {}
- - - ^decoder.layers.2
  - {}
- - - ^decoder.layers.3
  - {}
- - - ^decoder.adaptive_softmax
  - {} no_grad:
  - ^resnet
  - ^roberta

For example, how in allennlp or your code, no_grad is taking action? In which .py file, no_grad is used? And in which file, parameter_groups is working?

alasdairtran commented 4 years ago

1) You're right. The only thing the decoder and the roberta encoder have in common is the vocabulary file. The embeddings are different. I remember trying the idea of initialising the adaptive embeddings with the pretrained embedding matrix from roberta, but the performance wasn't as good, so I just ended up randomly initializing the adaptive embeddings. You can try this idea again and see if you can get better results.

2) Yeah that's where it's padded. desired_num_tokens is defined here here (also check out get_padding_lengths above it). Btw I think you accidentally mentioned the user OVERRIDES in your comment :-P Better to use a code block next time.

3) I guess it should be safe setting roberta.eval() just before you call extract_features. It will just overwrite whatever mode set by the parent. In Pytorch, when you run roberta.eval(), it will only affect the roberta module and its children. The parent class is unaffected. See how train() and eval() are implemented here.

4) no_grad is used by AllenNLP here. It just sets the requires_grad attribute of parameters that match the regex to False. parameter_groups is not really used anywhere. If you want to set different learning rates on different layers, you can add custom parameters inside those {} brackets and it will get parsed here and then passed down to the optimizer (e.g. Adam).

xuewyang commented 4 years ago

Yes, thank you. I will use a code block next time.

xuewyang commented 4 years ago

Hi, Can you explain the following quesitons?

Will this max_pos > self.weights.shape[0] happen sometime? here. I think you set init_size = 512, as here. And the max length of a caption to be generated is 100 as here. So 100 > 512 wont happen, right? How does register_buffer work? See here. I don't see TokenEmbedder have this function.
In batch, here, it seems that batch['context']['roberta'] has shape like torch.Size([15, 72]), while the raw context text has > 300 tokens in here. Why there is so big difference? As I am not familiar with allennlp, I don't know what role roberta_indexer.py play in making a 'batch'? How is this happen? I am afraid there exists the same problem in training phase?

Thanks.

xuewyang commented 4 years ago

For the eval() mode problem, what we only can do to solve this problem is call eval() every time before the extract_feature, like this: self.roberta = self.roberta.eval() X_sections_hiddens = self.roberta.extract_features( article_ids, return_all_hiddens=True) After testing, I found that we cannot eval() only in the init function.

alasdairtran commented 4 years ago

1) Yeah I don't think that code will ever run when we force all generated captions to be less than 100 tokens. register_buffer tells pytorch to not compute the gradient. We want this for the Sinusoidal Positional Embeddings since they just a bunch of sines and cosines, i.e. the embedding for each position is always fixed. For TokenEmbedder, we need the gradients to learn the embedding weights.

2) Perhaps you're looking at different batches? The shape [15, 72] should mean that there are 15 samples in that batch and the longest article in that batch has 72 tokens. Inside that loop, try printing out batch['metadata']['context'] and that will tell you the actual text that the IDs in batch['context']['roberta'] encode. At a high level, all roberta_indexer does is to convert the raw text into the IDs. The data iterator then shuffles the samples and yield batches with similar sizes. Because of the shuffling, the easiest way to find out the raw text corresponding to the IDs is to look at batch['metadata'].

3) Yep it makes sense to put eval() directly before calling extract_features since the parent class probably overwrites the mode at the start of each epoch.

xuewyang commented 4 years ago

Can you check this batch['context']['roberta'], the shape of it? I really think there might be a problem. batch['metadata']['context'] is correct, I think. I think these two should be of similar shape, but they are not. I also copy the following from the config file. Like here you use some sorting_keys. I think what it does is to sort the sampling according to the keys. Can you explain where this is used in your code? As you said, instance_per_epoch is the samples used for one epoch. But I don't know which function takes this as input. Also for maximum_samples_per_batch, I think it may be the number of tokens per batch, but still I can't find this variable in the codes. **iterator: type: bucket sorting_keys:
- - context
  - num_tokens
- - caption
  - num_tokens batch_size: 16 max_instances_in_memory: 8192 biggest_batch_first: false instances_per_epoch: 65536 maximum_samples_per_batch: ["num_tokens", 16384]**
For adaptive softmax, how to make sure the first 5000 tokens are of the highest frequencies? Is the vocabulary defined by ordering of frequencies?

alasdairtran commented 4 years ago

1) I just ran the evaluation script and added the following debug code:

print('tensor shape:', batch['context']['roberta'].shape)
print('raw text length:', end=' ')
for sample in batch['metadata']: 
    print(len(sample['context'].split()), end=' ')

Example output:

tensor shape: torch.Size([15, 385])
raw text length: 194 239 203 258 243 215 275 199 275 257 258 224 267 264 280

This looks reasonable to me. Give me a debug code to reproduce your problem if you're still having issues.

2) Both sorting_keys and maximum_samples_per_batch are used by the Bucket Iterator and its parent class. You can start here.

2) I simply assumed the roberta vocab is already sorted by frequency, so I didn't do any further sorting. I couldn't find any reference for this, but scanning through the first few hundred tokens in the vocab, it seems like a reasonable assumption.

xuewyang commented 4 years ago

Maybe I understand now. If we take a look at the following, the tensor shape and the raw text length is getting bigger with more iterations. So my guess is that the dataloader will group samples of similar sizes together and process together. The samples are sorted by the sorting keys.

tensor shape: torch.Size([15, 54]) raw text length: 16 19 22 30 29 31 32 32 38 34 29 26 39 39 33 tensor shape: torch.Size([15, 72])
raw text length: 33 33 30 40 36 41 38 48 47 38 49 43 55 55 50 tensor shape: torch.Size([15, 93]) raw text length: 50 63 52 55 54 60 65 61 71 57 60 57 63 59 56 ......................................... torch.Size([15, 266]) raw text length: 193 180 176 174 190 166 181 174 213 188 213 182 160 195 182

xuewyang commented 4 years ago

Do you have any experience with learning rate tuning? I now have batch size of 10 times more. Should I try bigger lr? Also, you are using bert adam optimizer, but the model is not bert, why not using adam or adamw?

alasdairtran commented 4 years ago

I think you just need to do some hyperparameter search on the validation set to find the best learning rate. I don't think we can say in advance what the best lr is.

The bert adam optimizer is just adamw that has a linear learning rate scheduler with warmup built in.

alasdairtran / transform-and-tell

adaptive embedder #8