Closed Zoher15 closed 1 year ago
I guess your input to the batch does not have the same length by padding the sequences?
How do I fix that? I simply was following all the instructions in the Readme step by step to reproduce the results....
Hi Zoher, give me a few days to rerun the script, I don't know what happens.
Hi @lamthuy,
Any updates?
Best, Zoher
Hi Zoher, I tried the script again and it works well for me, below is the example of training command that I run:
python -u -m amr_parsing.t5.cli.train --train ./ldc/amr_2/preprocessed_data/train.txt.features.nowiki --validation ./ldc/amr_2/preprocessed_data/dev.txt.features.nowiki --report_test ./ldc/amr_2/preprocessed_data/test.txt.features.nowiki --max_source_length 512 --max_target_length 512 --batch 8 -e 30 -m t5-base --model_type t5 --output ./t5_amr/ --data_type "amrdata" --task_type "text2amr" --val_from_epoch 10
And the training looks fine as below:
Downloading: 100%|███████████████████████████| 773k/773k [00:00<00:00, 13.0MB/s]
Downloading: 100%|█████████████████████████| 1.32M/1.32M [00:00<00:00, 21.7MB/s]
Downloading: 100%|█████████████████████████| 1.18k/1.18k [00:00<00:00, 1.07MB/s]
Downloading: 100%|███████████████████████████| 850M/850M [00:15<00:00, 56.5MB/s]
./ldc/amr_2/preprocessed_data/train.txt.features.nowiki
Number of file: 1
Loading and converting ./ldc/amr_2/preprocessed_data/train.txt.features.nowiki
ignoring epigraph data for duplicate triple: ('w', ':polarity', '-')
ignoring epigraph data for duplicate triple: ('b', ':mod', 'h')
ignoring epigraph data for duplicate triple: ('e', ':polarity', '-')
ignoring epigraph data for duplicate triple: ('c', ':ARG1', 'g')
Data size: 36521
Combine datasets size: 36521
Epoch: 1; Loss: 2.05596; Moving Loss: 2.99366;: 1%| | 34/4566 [03:56<6:54:25
Did you run data preprocessing before training?
python -u -m amr_utils.preprocess.preprocess_amr -i ./ldc/amr_2/data/amrs/split/ -o ./ldc/amr_2/preprocessed_data/
After preprocessing the data, my preprocessed data folder looks as below:
total 134912
-rw-r--r-- 1 tlhoa users 1274997 May 27 01:00 dev.txt
-rw-r--r-- 1 tlhoa users 2361469 May 27 01:01 dev.txt.features
-rw-r--r-- 1 tlhoa users 2275794 May 27 01:06 dev.txt.features.nowiki
-rw-r--r-- 1 tlhoa users 1318069 May 27 01:00 test.txt
-rw-r--r-- 1 tlhoa users 2439497 May 27 01:01 test.txt.features
-rw-r--r-- 1 tlhoa users 2349069 May 27 01:06 test.txt.features.nowiki
-rw-r--r-- 1 tlhoa users 27893761 May 27 01:01 train.txt
-rw-r--r-- 1 tlhoa users 49780704 May 27 01:06 train.txt.features
-rw-r--r-- 1 tlhoa users 48391247 May 27 01:06 train.txt.features.nowiki
Do you have the same files with the same size?
If you follow the same steps above and thing still does work I guess there might be a problem in the version of some libraries.
@lamthuy Thanks for your response. The only difference is I am using AMR3, but apart from that everything you just described is what I have. Including all the preprocessed files.
I think it might be some package version issue as well. Could you export and create an environment file that I could use?
Best, Zoher
Could you please first try AMR2 on your machine? I have not try with AMR3 recently. I will send you my environment file via email.
Hi @lamthuy,
Thanks for the .yml file. Since it was your conda base, it had a lot of packages, while most succeeded at installing some did not.
Despite setting up the new environment, I got the exact same error for AMR3. I would happily try it on AMR2, but I would have to purchase it. And for my project I don't really need AMR3 or AMR2. Because that is not my end goal.
My goal is to show the performance of different AMR parsers on a downstream application. I want to cite and use your ensemble for comparison. I am reproducing your pipeline to understand how it works, basically, so that I can tinker with it for my downstream task......
Best, Zoher
Hi Zoher, I ran the training with AMR3 and it has the issue that you got. I believe it concerns recent change in the tokenizer of transformers about padding and truncation, I tried running with batch_size =1 it works well, it only has an issue if we set batch size to a number larger than 1. I have tried a few options for truncation and padding but so far it does not work, to change the options for truncation and padding you can look into the amr_parsing/t5/models/utils.py where you can find the tokenizer in the prepare_data function. So there are two options for now to proceed:
If you still see the same issue please let me know, maybe there is a problem with the batch_encode_plus and we need to find an alternative tokenization for this data
Hi @lamthuy,
Thanks for getting back. Wouldn't training batch_size = 1 affect the training performance of T5?
I tried torch 1.6 and transformers 4.5.1, but it now gives me a cuda error cause the NVIDIA A100 GPU is not compatible with torch 1.6.
Either way, I am moving on from training. Since I don't need to train it for my downstream task, I can use other non T5 parsers. I will let you decide what you want to close this issue or not.
Best, Zoher
I think you can try newer torch to run on v100, the issue concerns transformers version rather than torch.
@lamthuy Yeah I tried torch 1.9.0 which works on A100. Went back to the original issue
I think I fixed the issue, could you please pull the latest code and try the training again?
@lamthuy YES! This fixed it
Hello @lamthuy ,
Could some help me with this? I was training T5 on preprocessed AMR 3.0 but there seems to be a tensor formatting error:
Best, Zoher