arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
168 stars 26 forks source link

Some Questions #13

Closed bryanwong17 closed 1 year ago

bryanwong17 commented 1 year ago

Hi @arxyzan ,

  1. Can you tell me what parts I need to change if my input size is 256 instead of 224?

  2. Is it mandatory to load encoder_checkpoint? Can I train my model from scratch?

  3. why is the config file named beit-pretraining.yaml for the vision task?

  4. Could you help me to solve problem below:

Epoch: 1/100 0%| | 0/18001 [00:02<?, ?batch/s] Traceback (most recent call last): File "/mnt/c/data2vec-pytorch/", line 25, in trainer.train() File "/mnt/c/data2vec-pytorch/vision/", line 131, in train train_loss = self.train_epoch(epoch) File "/mnt/c/data2vec-pytorch/vision/", line 97, in train_epoch loss = self.train_step(batch) File "/mnt/c/data2vec-pytorch/vision/", line 56, in train_step x, y = self.model(src, trg, mask) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(input, kwargs) File "/mnt/c/data2vec-pytorch/data2vec/", line 90, in forward y = self.ema.model(trg, ~mask, kwargs)['encoder_states'] # fetch the last transformer layers outputs File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(input, kwargs) File "/mnt/c/data2vec-pytorch/vision/", line 38, in forward outputs = self.encoder(pixel_values=inputs, output_hidden_states=True, output_attentions=True, kwargs) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/", line 681, in forward embedding_output = self.embeddings(pixel_values, bool_masked_pos) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/", line 154, in forward embeddings = self.patch_embeddings(pixel_values) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/", line 206, in forward embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

arxyzan commented 1 year ago

Hello @bryanwong17, and thanks for your feedback. Just keep in mind that Data2Vec is by itself a large pretrained model and only if you actually want to pretrain a model for vision/text/audio you can use it. Otherwise, if you want to finetune a model for a downstream task like image recognition, etc you have to use the pretrained weights and finetune from there. With all that:

  1. In order to change the input size, you can only do it for pretraining and you cannot change the model architecture for finetuning, because finetuning needs the pretrained model which is already trained with 224 pixel architecture. (this is not the case for text and audio cause they accept different input sizes)
  2. Loading from encoder_checkpoint only is there so the code would know which base model you are using based on the HuggingFace Hub path you provide. Actually no weight assigning or loading is happening there. It just loads the config file from that path in HF Hub and figure out what model you want to use. This way I was able to provide a general encoder class for vision by using transformers.AutoModel and transformers.AutoConfig. Otherwise one would have to provice one encoder class for any base architecture they wanted to use.
  3. Because the default data2vec for vision, uses the model BEiT as the base encoder model. If you want to use another model you can provide a new config file for that.
  4. Thanks for reporting this issue. I just fixed it. you can now try the code and it works fine.
bryanwong17 commented 1 year ago

Hi @arxyzan, thanks for quick response. My goal in training Data2Vec is to replace the Multiple Instance Learning (MIL) framework's feature extractor with Data2vec, allowing all extracted images (patches) to be loaded into the trained embedder. Do you suggest training datavec from scratch? So far, I have around 500k histopathology images (256 x 256)

arxyzan commented 1 year ago

Data2Vec and similar large image models are all trained on huge amount of data from ImageNet. Considering the fact that the domain of the data is fairly different from what you are working on, I don't think using a large model which is designed for large pretraining is the perfect decision unless you have this much data from your desired domain. I'm not familiar with histopathology data but I think you'd be better look for a base model that is suitable for that kind of data.