a-rios / longmbart

Apache License 2.0
4 stars 6 forks source link

float_mask.repeat(1, 1, repeat_size, 1) causes RuntimeError: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor #13

Closed tschomacker closed 2 years ago

tschomacker commented 2 years ago

I am trying to fine-tune my own longmbart on text simplification. But I am little stucked. Conversion worked but I got an Error when starting to fine-tune. I would really appreciate any hints on how to fix the problem.

What I did previously:

  1. pip install -q -r requirements.txt
  2. converted the model:
    python ./scripts/convert_mbart_to_longformerencoderdecoder.py \
    --save_model_to ./output/converted-longmbart \
    --attention_window 512 \
    --cache_dir ./output/mbart-large-cc25 \
    --base_model facebook/mbart-large-cc25 \
    --tokenizer_name_or_path facebook/mbart-large-cc25\
    --add_language_tags de_OR de_SI \
    --initialize_tags de_DE de_DE \
    --max_pos 1024 \
    --verbose 1
  3. started the fine-tuning:
    python -m longformer.simplification \
    --from_pretrained ./output/converted-longmbart \
    --tokenizer ./output/converted-longmbart \
    --save_dir ./output/longmbart-fine-tuned \
    --save_prefix "w512" \
    --train_source ./data/train-source.txt \
    --train_target ./data/train-target.txt \
    --val_source ./data/val-source.txt \
    --val_target ./data/val-target.txt \
    --test_source ./data/test-source.txt \
    --test_target ./data/test-target.txt \
    --max_output_len 1024 \
    --max_input_len 1024 \
    --batch_size 1 \
    --grad_accum 60 \
    --num_workers 5 \
    --gpus 1 \
    --seed 222 \
    --attention_dropout 0.1 \
    --dropout 0.3 \
    --attention_mode sliding_chunks \
    --attention_window 512 \
    --label_smoothing 0.2 \
    --lr 0.00003 \
    --val_every 1.0 \
    --val_percent_check 1.0 \
    --test_percent_check 1.0 \
    --early_stopping_metric 'rougeL' \
    --patience 10 \
    --lr_reduce_patience 8 \
    --lr_reduce_factor 0.5 \
    --grad_ckpt \
    --progress_bar_refresh_rate 10 \
    --tags_included

    This threw the following RuntimeError:

    Current Bevior: RuntimeError

    Epoch 0:   0%|                                            | 0/2 [00:00<?, ?it/s]
    Traceback (most recent call last):
    File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    File "/home/jovyan/git/longmbart/longformer/simplification.py", line 527, in <module>
    main(args)
    File "/home/jovyan/git/longmbart/longformer/simplification.py", line 518, in main
    trainer.fit(model)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 307, in ddp_train
    results = self.train_or_test()
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 549, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 482, in optimizer_step
    model_ref.optimizer_step(
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1296, in optimizer_step
    optimizer.step(closure=optimizer_closure)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 286, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 140, in __optimizer_step
    trainer.precision_connector.backend.optimizer_step(trainer, optimizer, closure)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/plugins/native_amp.py", line 75, in optimizer_step
    closure()
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 694, in train_step_and_backward_closure
    result = self.training_step_and_backward(
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 792, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 316, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 164, in training_step
    return self._step(args)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 176, in _step
    output = self.trainer.model(*args)
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/overrides/data_parallel.py", line 179, in forward
    output = self.module.training_step(*inputs[0], **kwargs[0])
    File "/home/jovyan/git/longmbart/longformer/simplification.py", line 251, in training_step
    output = self.forward(*batch)
    File "/home/jovyan/git/longmbart/longformer/simplification.py", line 231, in forward
    outputs = self.model(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/modeling_mbart.py", line 1346, in forward
    outputs = self.model(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/modeling_mbart.py", line 1211, in forward
    encoder_outputs = self.encoder(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/modeling_mbart.py", line 840, in forward
    layer_outputs = encoder_layer(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/modeling_mbart.py", line 331, in forward
    hidden_states, attn_weights, _ = self.self_attn(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/jovyan/git/longmbart/longformer/longformer_encoder_decoder.py", line 66, in forward
    outputs = self.longformer_self_attn(
    File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/jovyan/git/longmbart/longformer/longformer.py", line 184, in forward
    float_mask = float_mask.repeat(1, 1, repeat_size, 1)
    RuntimeError: Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor
    ​

I have checked float_mask and its size: torch.Size([1, 1, 1024, 1024, 1, 1]). Which looks odd to me

a-rios commented 2 years ago

Hi, the shape should be ([1, 1024, 1,1]). I cannot reproduce this, could you send me a minimal sample of your data/script that produces this error? and just to make sure, the transformers version you have installed, is it the one linked in the requirements file (not the default huggingface code, we had make some changes to mbart).

tschomacker commented 2 years ago

Hi, thanks for the very quick response. I have actually changed the requirements and installed the 'normal' transformers package. I changed it because running the conversion (same call as aboce) with transformers @ git+https://github.com/ZurichNLP/transformers.git@longmbart#egg=transformers installed, resulted in:

Traceback (most recent call last):
  File "/home/jovyan/git/longmbart/./scripts/convert_mbart_to_longformerencoderdecoder.py", line 11, in <module>
    from transformers import MBartForConditionalGeneration
  File "/opt/conda/lib/python3.9/site-packages/transformers/__init__.py", line 2162, in __getattr__
    return super().__getattr__(name)
  File "/opt/conda/lib/python3.9/site-packages/transformers/file_utils.py", line 1479, in __getattr__
    value = getattr(module, name)
  File "/opt/conda/lib/python3.9/site-packages/transformers/file_utils.py", line 1478, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/__init__.py", line 89, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/opt/conda/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/mbart/modeling_mbart.py", line 47, in <module>
    from longformer.longformer_encoder_decoder import LongformerSelfAttentionForBart
ModuleNotFoundError: No module named 'longformer'

This issues was resolved after switching to 'normal' transformers.

a-rios commented 2 years ago

ok, longmbart will not run with the standard transformer library, because longmbart uses attention masks with 3 values (0,1,2) instead of the standard (0,1) - this is to distinguish local and global attention. You need the transformers repo linked in requirements. The conversion error looks like your longmbart repo wasn't installed in your python environment, you can do this with (from within the longmbart directory): pip install -e .