Why "import sled" was commented out in run.py?

shi-kejian commented 11 months ago

Hi,

Thank you again for this great effort.

As title reads, why the current commit of run.py has from sled import SledConfig and import sled # *** required so that SledModels will be registered for the AutoClasses *** were commented out? May I ask if the import is no longer needed in default setting?

https://github.com/abertsch72/unlimiformer/blob/651c5b37d96d676e1da32e36b05dc388bcc440e4/src/run.py#L31C28-L31C28

File "/unlimiformer/src/unlimiformer.py", line 814, in convert_model type_to_class[type(model)](model, *args, **kwargs)


KeyError: <class 'sled.modeling_sled.SledForConditionalGeneration'>

urialon commented 11 months ago

Hi @shi-kejian , Thank you for your interest in our work!

We based our implementation on SLED's code, so there might be some leftovers from SLED.

What is the command line that you were running when encountering this error?

Best, Uri

shi-kejian commented 11 months ago

Oh. Thank you. I am actually following up on #49

I'm trying to use either facebook/bart-base or sled on my local dataset.

python src/run.py \ src/configs/training/base_training_args.json \ src/configs/data/my_own_data.json\ --model_name_or_path tau/bart-large-sled \ --use_auth_token false \ --overwrite_cache \ --output_dir output_train_bart_large_meeting_oracle/ \ --overwrite_output_dir \ --max_source_length 1024 \ --eval_max_source_length 999999 \ --generation_max_length 640 \ --max_target_length 640 \ --max_prefix_length 96 \ --do_eval=True \ --learning_rate 1e-5 \ --per_device_eval_batch_size 1 \ --per_device_train_batch_size 2 \ --unlimiformer_training=True \ --test_unlimiformer \ --eval_steps 30 --save_steps 30 \ --num_train_epochs 10 \ --metric_names rouge \ --extra_metrics bertscore \ --metric_for_best_model bertscore \ --fp16 \

My data.json:

{ "dataset_name": "", "dataset_config_name": "default", "max_source_length": 16384, "generation_max_length": 640, "max_prefix_length": 96, "pad_prefix": true, "num_train_epochs": 10, "input_column": "Article", "input_prefix_column": "Query", "output_column": "Summary", "metric_names": ["rouge"], "metric_for_best_model": "rouge/geometric_mean", "greater_is_better": true }

Thank you!

urialon commented 11 months ago

Do you manage to run with the existing datasets, e.g., GovReport?
If you copy the gov_report.json to my_data.json and just change the variables at my_data.json to point to your datasets, does it work?

shi-kejian commented 11 months ago

Thank you.

It turns out that adding --fp16 flag will break.

  File "/ext3/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))

Sticking with fp32 can circumvent the issue.

Do you manage to run with the existing datasets, e.g., GovReport?

Running the quickstart for GovReport actually broke for me.

/ext3/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' reproduce.bash: line 11: 2960178 Segmentation fault (core dumped)

If you copy the gov_report.json to my_data.json and just change the variables at my_data.json to point to your datasets, does it work?

For single GPU: File "unlimiformer/src/run.py", line 802, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2665, in training_step self.accelerator.backward(loss) ........ ....... .... ^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 244, in backward tensors = ctx.saved_tensors ^^^^^^^^^^^^^^^^^ RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

For multi-GPU: It again hits #49

So for single GPU it fortunately starts backward

So for 1. - reproducing standard finetuning- I got segmentation fault. Is it just me or someone else is getting the issue? Do you happen to have some experience / encountered this before?

Thank you very much!

abertsch72 / unlimiformer

Why "import sled" was commented out in run.py? #50