Tensor Formatting Issue while Training T5

Zoher15 commented 1 year ago

Hello @lamthuy ,

Could some help me with this? I was training T5 on preprocessed AMR 3.0 but there seems to be a tensor formatting error:

Traceback (most recent call last):
  File "/N/u/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 698, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 29 at dim 1 (got 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/geode2/home/u110/zkachwal/BigRed3/graph_ensemble_learning/amr_parsing/t5/cli/train.py", line 209, in <module>
    train(epoch, args.batch, args.train.split(), args.data_type.split(), args.task_type.split())
  File "/geode2/home/u110/zkachwal/BigRed3/graph_ensemble_learning/amr_parsing/t5/cli/train.py", line 60, in train
    loss = net(source, target)
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/geode2/home/u110/zkachwal/BigRed3/graph_ensemble_learning/amr_parsing/t5/models/lg.py", line 49, in forward
    target_ids, target_mask = prepare_data(batch_target, self.tokenizer, self.max_target_length, self.target_pad_id)
  File "/geode2/home/u110/zkachwal/BigRed3/graph_ensemble_learning/amr_parsing/t5/models/utils.py", line 23, in prepare_data
    return_tensors=TensorType.PYTORCH)
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2456, in batch_encode_plus
    **kwargs,
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 563, in _batch_encode_plus
    verbose=verbose,
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 629, in _batch_prepare_for_model
    batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 203, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/N/u/zkachwal/Carbonate/miniconda3/envs/graphene/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 711, in convert_to_tensors
    "Unable to create tensor returning overflowing tokens of different lengths. "
ValueError: Unable to create tensor returning overflowing tokens of different lengths. Please see if a fast version of this tokenizer is available to have this feature available.

Best, Zoher

lamthuy commented 1 year ago

I guess your input to the batch does not have the same length by padding the sequences?

Zoher15 commented 1 year ago

How do I fix that? I simply was following all the instructions in the Readme step by step to reproduce the results....

lamthuy commented 1 year ago

Hi Zoher, give me a few days to rerun the script, I don't know what happens.

Zoher15 commented 1 year ago

Hi @lamthuy,

Any updates?

Best, Zoher

lamthuy commented 1 year ago

Hi Zoher, I tried the script again and it works well for me, below is the example of training command that I run:

python -u -m amr_parsing.t5.cli.train --train ./ldc/amr_2/preprocessed_data/train.txt.features.nowiki  --validation ./ldc/amr_2/preprocessed_data/dev.txt.features.nowiki --report_test ./ldc/amr_2/preprocessed_data/test.txt.features.nowiki --max_source_length 512 --max_target_length 512 --batch 8 -e 30 -m t5-base --model_type t5 --output ./t5_amr/ --data_type "amrdata"  --task_type "text2amr" --val_from_epoch 10

And the training looks fine as below:

Downloading: 100%|███████████████████████████| 773k/773k [00:00<00:00, 13.0MB/s]
Downloading: 100%|█████████████████████████| 1.32M/1.32M [00:00<00:00, 21.7MB/s]
Downloading: 100%|█████████████████████████| 1.18k/1.18k [00:00<00:00, 1.07MB/s]
Downloading: 100%|███████████████████████████| 850M/850M [00:15<00:00, 56.5MB/s]
./ldc/amr_2/preprocessed_data/train.txt.features.nowiki
Number of file: 1
Loading and converting ./ldc/amr_2/preprocessed_data/train.txt.features.nowiki
ignoring epigraph data for duplicate triple: ('w', ':polarity', '-')
ignoring epigraph data for duplicate triple: ('b', ':mod', 'h')
ignoring epigraph data for duplicate triple: ('e', ':polarity', '-')
ignoring epigraph data for duplicate triple: ('c', ':ARG1', 'g')
Data size: 36521
Combine datasets size: 36521
Epoch: 1; Loss: 2.05596; Moving Loss: 2.99366;:   1%| | 34/4566 [03:56<6:54:25

Did you run data preprocessing before training?

python -u -m amr_utils.preprocess.preprocess_amr -i ./ldc/amr_2/data/amrs/split/ -o ./ldc/amr_2/preprocessed_data/

After preprocessing the data, my preprocessed data folder looks as below:

total 134912
-rw-r--r-- 1 tlhoa users  1274997 May 27 01:00 dev.txt
-rw-r--r-- 1 tlhoa users  2361469 May 27 01:01 dev.txt.features
-rw-r--r-- 1 tlhoa users  2275794 May 27 01:06 dev.txt.features.nowiki
-rw-r--r-- 1 tlhoa users  1318069 May 27 01:00 test.txt
-rw-r--r-- 1 tlhoa users  2439497 May 27 01:01 test.txt.features
-rw-r--r-- 1 tlhoa users  2349069 May 27 01:06 test.txt.features.nowiki
-rw-r--r-- 1 tlhoa users 27893761 May 27 01:01 train.txt
-rw-r--r-- 1 tlhoa users 49780704 May 27 01:06 train.txt.features
-rw-r--r-- 1 tlhoa users 48391247 May 27 01:06 train.txt.features.nowiki

Do you have the same files with the same size?

If you follow the same steps above and thing still does work I guess there might be a problem in the version of some libraries.

Zoher15 commented 1 year ago

@lamthuy Thanks for your response. The only difference is I am using AMR3, but apart from that everything you just described is what I have. Including all the preprocessed files.

I think it might be some package version issue as well. Could you export and create an environment file that I could use?

Best, Zoher

lamthuy commented 1 year ago

Could you please first try AMR2 on your machine? I have not try with AMR3 recently. I will send you my environment file via email.

Zoher15 commented 1 year ago

Hi @lamthuy,

Thanks for the .yml file. Since it was your conda base, it had a lot of packages, while most succeeded at installing some did not.

Despite setting up the new environment, I got the exact same error for AMR3. I would happily try it on AMR2, but I would have to purchase it. And for my project I don't really need AMR3 or AMR2. Because that is not my end goal.

My goal is to show the performance of different AMR parsers on a downstream application. I want to cite and use your ensemble for comparison. I am reproducing your pipeline to understand how it works, basically, so that I can tinker with it for my downstream task......

Best, Zoher

lamthuy commented 1 year ago

Hi Zoher, I ran the training with AMR3 and it has the issue that you got. I believe it concerns recent change in the tokenizer of transformers about padding and truncation, I tried running with batch_size =1 it works well, it only has an issue if we set batch size to a number larger than 1. I have tried a few options for truncation and padding but so far it does not work, to change the options for truncation and padding you can look into the amr_parsing/t5/models/utils.py where you can find the tokenizer in the prepare_data function. So there are two options for now to proceed:

You can train with batch_size = 1
install an environment with torch 1.6, transformers==4.5.1

If you still see the same issue please let me know, maybe there is a problem with the batch_encode_plus and we need to find an alternative tokenization for this data

Zoher15 commented 1 year ago

Hi @lamthuy,

Thanks for getting back. Wouldn't training batch_size = 1 affect the training performance of T5?

I tried torch 1.6 and transformers 4.5.1, but it now gives me a cuda error cause the NVIDIA A100 GPU is not compatible with torch 1.6.

Either way, I am moving on from training. Since I don't need to train it for my downstream task, I can use other non T5 parsers. I will let you decide what you want to close this issue or not.

Best, Zoher

lamthuy commented 1 year ago

I think you can try newer torch to run on v100, the issue concerns transformers version rather than torch.

Zoher15 commented 1 year ago

@lamthuy Yeah I tried torch 1.9.0 which works on A100. Went back to the original issue

lamthuy commented 1 year ago

I think I fixed the issue, could you please pull the latest code and try the training again?

Zoher15 commented 1 year ago

@lamthuy YES! This fixed it

IBM / graph_ensemble_learning

Tensor Formatting Issue while Training T5 #2