Closed apupneja closed 1 year ago
At a high level, what are you trying to accomplish? These particular models suffer from some major issues (three papers), and require some expensive (to an individual) compute to run.
In depth comments:
A line like:
/bin/bash: line 1: 9107 Killed python3 scripts/modeling/summarizer_input_prep.py --input /content/ms2/ms2_data/training_reviews.jsonl --output output/training.json --tokenizer facebook/bart-base --max_length 500
means the script is being killed, likely for memory consumption.
This is mostly because I wrote the script for a multiprocessing environment with lots of memory. I don't think it's possible run training without at least a 40GB GPU (I trained on RTX8000s, 48GB units), and that might still fall a little short. Inference for a trained model might be possible on a smaller GPU (say a 32GB unit).
You could either rewrite the script from here, starting with reading the reviews to be streaming, or try removing the parallelism; I think only the former will really work, and will still take a long time and be subject to the other resource constraints if you try the same learning approaches.
Thank you for your reply. The goal was to reproduce the baseline, just for a better understanding of the models performance.
The process getting killed because of memory consumption was my guess to but since the execution went on for some time, I was not very sure.
Just so I have a clearer understanding, what makes the training/preprocessing this heavy? Is it just the shear size of the dataset? I assumed that finetuning BART/Longformer shouldn't be a problem on Colab.
better understanding of the models performance
The papers I cite above also do a little of that.
Just so I have a clearer understanding, what makes the training/preprocessing this heavy? Is it just the shear size of the dataset?
This is tokenizing ~470k documents, . The average input for a review is something like 9.4k tokens (multiple documents!), the median is somewhere around 6k or 7k tokens. These statistics are available in the paper.
I assumed that finetuning BART/Longformer shouldn't be a problem on Colab.
The longformer/BART models are modified and are taking a substantially longer input than the pretrained versions (16k max length vs. 1k max length; memory usage is quadratic in document length). You might be able to run something reasonable over the reviews with fewer inputs; I haven't explored in that direction.
You can also download a model checkpoint (see the README).
Right. Thank you for all the detailed replies!
I am trying to reproduce the baseline for the dataset.
When I run the input prep script, particularly
scripts/modeling/summarizer_input_prep.py
, it runs for close to 30 minutes and gives the following error, and the process is killed.I get the same error on both, colab and when I run it locally. Here's a gist to reproduce the issue https://colab.research.google.com/gist/apupneja/a63e3f261b3b5eecea087018f8486b32/ms2_baseline.ipynb