allenai / ms2

Apache License 2.0
64 stars 12 forks source link

Error when running the input prep scripts #12

Closed apupneja closed 1 year ago

apupneja commented 1 year ago

I am trying to reproduce the baseline for the dataset.

When I run the input prep script, particularly scripts/modeling/summarizer_input_prep.py , it runs for close to 30 minutes and gives the following error, and the process is killed.

2023-02-28 06:06:32,014 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:32,196 https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/facebook/bart-base/config.json HTTP/1.1" 200 0
2023-02-28 06:06:32,198 Attempting to acquire lock 140314322770480 on /root/.cache/torch/transformers/09f4fcaeaf785dd3b97b085d6e3510c7081f586ec8e75981683c6299c0f81d9d.e8d516ad807436d395effad8c2326786872659b7dd1210827ac67c761198a0eb.lock
2023-02-28 06:06:32,199 Lock 140314322770480 acquired on /root/.cache/torch/transformers/09f4fcaeaf785dd3b97b085d6e3510c7081f586ec8e75981683c6299c0f81d9d.e8d516ad807436d395effad8c2326786872659b7dd1210827ac67c761198a0eb.lock
2023-02-28 06:06:32,202 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:32,414 https://s3.amazonaws.com:443 "GET /models.huggingface.co/bert/facebook/bart-base/config.json HTTP/1.1" 200 1553
Downloading: 100% 1.55k/1.55k [00:00<00:00, 774kB/s]
2023-02-28 06:06:32,425 Attempting to release lock 140314322770480 on /root/.cache/torch/transformers/09f4fcaeaf785dd3b97b085d6e3510c7081f586ec8e75981683c6299c0f81d9d.e8d516ad807436d395effad8c2326786872659b7dd1210827ac67c761198a0eb.lock
2023-02-28 06:06:32,426 Lock 140314322770480 released on /root/.cache/torch/transformers/09f4fcaeaf785dd3b97b085d6e3510c7081f586ec8e75981683c6299c0f81d9d.e8d516ad807436d395effad8c2326786872659b7dd1210827ac67c761198a0eb.lock
2023-02-28 06:06:32,435 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:32,604 https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/roberta-large-vocab.json HTTP/1.1" 200 0
2023-02-28 06:06:32,607 Attempting to acquire lock 140314322769472 on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
2023-02-28 06:06:32,607 Lock 140314322769472 acquired on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
2023-02-28 06:06:32,610 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:32,808 https://s3.amazonaws.com:443 "GET /models.huggingface.co/bert/roberta-large-vocab.json HTTP/1.1" 200 898823
Downloading: 100% 899k/899k [00:00<00:00, 8.80MB/s]
2023-02-28 06:06:32,912 Attempting to release lock 140314322769472 on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
2023-02-28 06:06:32,912 Lock 140314322769472 released on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
2023-02-28 06:06:32,915 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:33,080 https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/roberta-large-merges.txt HTTP/1.1" 200 0
2023-02-28 06:06:33,083 Attempting to acquire lock 140314322770480 on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
2023-02-28 06:06:33,084 Lock 140314322770480 acquired on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
2023-02-28 06:06:33,086 Starting new HTTPS connection (1): s3.amazonaws.com:443
2023-02-28 06:06:33,292 https://s3.amazonaws.com:443 "GET /models.huggingface.co/bert/roberta-large-merges.txt HTTP/1.1" 200 456318
Downloading: 100% 456k/456k [00:00<00:00, 5.67MB/s]
2023-02-28 06:06:33,374 Attempting to release lock 140314322770480 on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
2023-02-28 06:06:33,374 Lock 140314322770480 released on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
Special tokens have been added in the vocabulary, make sure the associated word emebedding are fine-tuned or trained.
/bin/bash: line 1:  9107 Killed                  python3 scripts/modeling/summarizer_input_prep.py --input /content/ms2/ms2_data/training_reviews.jsonl --output output/training.json --tokenizer facebook/bart-base --max_length 500

I get the same error on both, colab and when I run it locally. Here's a gist to reproduce the issue https://colab.research.google.com/gist/apupneja/a63e3f261b3b5eecea087018f8486b32/ms2_baseline.ipynb

jayded commented 1 year ago

At a high level, what are you trying to accomplish? These particular models suffer from some major issues (three papers), and require some expensive (to an individual) compute to run.

In depth comments:

A line like:

/bin/bash: line 1:  9107 Killed                  python3 scripts/modeling/summarizer_input_prep.py --input /content/ms2/ms2_data/training_reviews.jsonl --output output/training.json --tokenizer facebook/bart-base --max_length 500

means the script is being killed, likely for memory consumption.

This is mostly because I wrote the script for a multiprocessing environment with lots of memory. I don't think it's possible run training without at least a 40GB GPU (I trained on RTX8000s, 48GB units), and that might still fall a little short. Inference for a trained model might be possible on a smaller GPU (say a 32GB unit).

You could either rewrite the script from here, starting with reading the reviews to be streaming, or try removing the parallelism; I think only the former will really work, and will still take a long time and be subject to the other resource constraints if you try the same learning approaches.

apupneja commented 1 year ago

Thank you for your reply. The goal was to reproduce the baseline, just for a better understanding of the models performance.

The process getting killed because of memory consumption was my guess to but since the execution went on for some time, I was not very sure.

Just so I have a clearer understanding, what makes the training/preprocessing this heavy? Is it just the shear size of the dataset? I assumed that finetuning BART/Longformer shouldn't be a problem on Colab.

jayded commented 1 year ago

better understanding of the models performance

The papers I cite above also do a little of that.

Just so I have a clearer understanding, what makes the training/preprocessing this heavy? Is it just the shear size of the dataset?

This is tokenizing ~470k documents, . The average input for a review is something like 9.4k tokens (multiple documents!), the median is somewhere around 6k or 7k tokens. These statistics are available in the paper.

I assumed that finetuning BART/Longformer shouldn't be a problem on Colab.

The longformer/BART models are modified and are taking a substantially longer input than the pretrained versions (16k max length vs. 1k max length; memory usage is quadratic in document length). You might be able to run something reasonable over the reviews with fewer inputs; I haven't explored in that direction.

You can also download a model checkpoint (see the README).

apupneja commented 1 year ago

Right. Thank you for all the detailed replies!