Closed shi-kejian closed 10 months ago
Hi @shi-kejian , Thank you for your interest in our work!
We based our implementation on SLED's code, so there might be some leftovers from SLED.
What is the command line that you were running when encountering this error?
Best, Uri
Oh. Thank you. I am actually following up on #49
I'm trying to use either facebook/bart-base or sled on my local dataset.
python src/run.py \ src/configs/training/base_training_args.json \ src/configs/data/my_own_data.json\ --model_name_or_path tau/bart-large-sled \ --use_auth_token false \ --overwrite_cache \ --output_dir output_train_bart_large_meeting_oracle/ \ --overwrite_output_dir \ --max_source_length 1024 \ --eval_max_source_length 999999 \ --generation_max_length 640 \ --max_target_length 640 \ --max_prefix_length 96 \ --do_eval=True \ --learning_rate 1e-5 \ --per_device_eval_batch_size 1 \ --per_device_train_batch_size 2 \ --unlimiformer_training=True \ --test_unlimiformer \ --eval_steps 30 --save_steps 30 \ --num_train_epochs 10 \ --metric_names rouge \ --extra_metrics bertscore \ --metric_for_best_model bertscore \ --fp16 \
My data.json:
{
"dataset_name": "
Thank you!
Do you manage to run with the existing datasets, e.g., GovReport?
If you copy the gov_report.json
to my_data.json
and just change the variables at my_data.json
to point to your datasets, does it work?
Thank you.
It turns out that adding --fp16 flag will break.
File "/ext3/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
Sticking with fp32 can circumvent the issue.
Do you manage to run with the existing datasets, e.g., GovReport?
Running the quickstart for GovReport actually broke for me.
/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' reproduce.bash: line 11: 2960178 Segmentation fault (core dumped)
If you copy the gov_report.json to my_data.json and just change the variables at my_data.json to point to your datasets, does it work?
For single GPU: File "unlimiformer/src/run.py", line 802, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2665, in training_step self.accelerator.backward(loss) ........ ....... .... ^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 244, in backward tensors = ctx.saved_tensors ^^^^^^^^^^^^^^^^^ RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
For multi-GPU: It again hits #49
So for single GPU it fortunately starts backward
So for 1. - reproducing standard finetuning- I got segmentation fault. Is it just me or someone else is getting the issue? Do you happen to have some experience / encountered this before?
Thank you very much!
Hi,
Thank you again for this great effort.
As title reads, why the current commit of run.py has
from sled import SledConfig
andimport sled # *** required so that SledModels will be registered for the AutoClasses ***
were commented out? May I ask if the import is no longer needed in default setting?https://github.com/abertsch72/unlimiformer/blob/651c5b37d96d676e1da32e36b05dc388bcc440e4/src/run.py#L31C28-L31C28
File "/unlimiformer/src/unlimiformer.py", line 814, in convert_model type_to_class[type(model)](model, *args, **kwargs)