Error while training NLP

swarada96 commented 1 year ago

Traceback (most recent call last): File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\train.py", line 24, in trainer = trainers_dictmodality File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\text\trainer.py", line 55, in init self.test_loader = DataLoader(self.test_dataset, batch_size=cfg.train.val_batch_size, File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 355, in getattr self._format_and_raise( File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\base.py", line 231, in _format_and_raise format_and_raise( File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 899, in format_and_raise _raise(ex, cause) File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 797, in _raise raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 351, in getattr return self._get_impl( File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 442, in _get_impl node = self._get_child( File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\basecontainer.py", line 73, in _get_child child = self._get_node( File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 480, in _get_node raise ConfigKeyError(f"Missing key {key!s}") omegaconf.errors.ConfigAttributeError: Missing key val_batch_size full_key: train.val_batch_size object_type=dict

Can you please help me with the omegaconf. What package version is used while training the datasets?

arxyzan commented 1 year ago

Hello @swarada96. Thanks for your feedback. The problem here is that the config property val_batch_size is not present in your config file. You can add it there to fix this. But I pushed some new changes in order to fix this. Update the repo and it should work fine.

swarada96 commented 1 year ago

I updated the repo as per your suggestion and the it made a difference. Thank you for that. But, after using the updated file, my CUDA runs out of memory. What changes can you suggest me. Thanking you in advance.

PS F:\data2vec-pytorch-main> python train.py --config text/configs/roberta-pretraining.yaml Found cached dataset wikitext (C:/Users/User/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732 210bead40c090019a499ceb13bcbfa3f8ab646a126) 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.05it/s] Found cached dataset wikitext (C:/Users/User/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732 210bead40c090019a499ceb13bcbfa3f8ab646a126) 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 50.40it/s]

Epoch: 1/20 0%| | 0/56293 [00:00<?, ?batch/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than us ing a method to encode the text followed by a call to the pad method to get a padded encoding. Epoch: 1/20 0%| | 0/56293 [00:09<?, ?batch/s] Traceback (most recent call last): File "F:\data2vec-pytorch-main\train.py", line 25, in trainer.train() File "F:\data2vec-pytorch-main\text\trainer.py", line 145, in train train_loss = self.train_epoch(epoch) loss = self.train_step(batch) File "F:\data2vec-pytorch-main\text\trainer.py", line 68, in train_step x, y = self.model(src, trg, mask) File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "F:\data2vec-pytorch-main\data2vec\data2vec.py", line 83, in forward x = self.encoder(src, mask, kwargs)['encoder_out'] # fetch the last layer outputs File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "F:\data2vec-pytorch-main\text\encoder.py", line 38, in forward outputs = self.encoder(inputs, output_hidden_states=True, output_attentions=True, kwargs) File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 846, in forward encoder_outputs = self.encoder( File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 520, in forward layer_outputs = layer_module( File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 405, in forward self_attention_outputs = self.attention( File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 332, in forward self_outputs = self.self( File "C:\Users\User\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\User\anaconda3\lib\site-packages\transformers\models\roberta\modeling_roberta.py", line 234, in forward attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) RuntimeError: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 4.00 GiB total capacity; 2.94 GiB already allocated; 0 bytes free; 3.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

arxyzan commented 1 year ago

@swarada96 You've got 4GB in total, It'd be better to set a lower batch_size

swarada96 commented 1 year ago

Hello Aryan,

Do we have to create a file named dummy_data in order to save the split data ? I am getting the following error for vision encoding.

python train.py --config vision/configs/beit-pretraining.yaml Traceback (most recent call last): File "/home1/08351/sak3951/Work/data2vec-pytorch/train.py", line 24, in trainer = trainers_dictmodality File "/home1/08351/sak3951/Work/data2vec-pytorch/vision/trainer.py", line 31, in init self.train_dataset = MIMPretrainingDataset(cfg, split='train') File "/home1/08351/sak3951/Work/data2vec-pytorch/vision/dataset.py", line 23, in init super(MIMPretrainingDataset, self).init(root=cfg.dataset.path[split]) File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 309, in init super().init( File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 144, in init classes, class_to_idx = self.find_classes(self.root) File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 218, in find_classes return find_classes(directory) File "/home1/08351/sak3951/.local/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 40, in find_classes classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) FileNotFoundError: [Errno 2] No such file or directory: 'vision/dummy_data'

arxyzan commented 1 year ago

Hello again @swarada96, sorry for the delay.

The dummy_data is a random folder name containing all the image files. You can define or create your own directory of images.

arxyzan / data2vec-pytorch

Error while training NLP #17