bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Bicleaner consumes all available memory #51

Closed cgr71ii closed 3 years ago

cgr71ii commented 3 years ago

When doing some tests, I have detected that Bicleaner makes the computer to freeze because it consumes all available memory without end. The issue seems to not be affected by the concrete Bicleaner model since I have used 2 different models (one created through training and https://github.com/bitextor/bicleaner-data/releases/download/v1.4/en-fr.tar.gz). The file which makes to fail is bicleaner_classifier_lite.py, and is invoked like it is done in Bitextor.

I have checked out that also fails when only bicleaner_classifier_lite.py is invoked with the parameters --score_only -q - -, and the attached file 0.gz is piped with zcat. The exact command which fails is: zcat /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/en_fr/hunalign.06_02.segalign/0.gz | python3 ~/bitextor/bicleaner/bicleaner/bicleaner_classifier_lite.py --score_only -q - - /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml (the model new-en-fr.yaml is attached too in model.tar.gz, which is the one I got through training, but as I have said, I have checked out with the model provided in Bicleaner Data as well).

The reason seems to be the use of the warc greenpeaceaa.warc.gz, which is a WARC that I have splitted from the original greenpeace.warc.gz (this WARC does not fail, and is ~20 times bigger than the splitted version). When I use warcio check greenpeaceaa.warc.gz it seems to be a valid WARC, and I have splitted this WARC using split-warc.py from Bitextor.

0.gz greenpeaceaa.warc.gz model.tar.gz

ZJaume commented 3 years ago

I tried 0.gz myself with the French model and does not fail, so I'm assuming you could have less than ~2GB of memory available on your machine? The Bicleaner 0.14 models are quite large due to Extremely Randomized Trees classifier but it shouldn't use more than 2.5GB, only 10K sentences are loaded into memory at each time.

cgr71ii commented 3 years ago

Ok, I have tried again and it worked with the French model. I though that did not work before because I let Bitextor running and it got frozen, then I guessed it had happened because this problem that I was debugging. Even so, I have checked out again that it consumes all the available memory if model.tar.gz is used (if you are going to give it a try, you will need to modify the paths of new-en-fr.yaml file). The machine which I am using, before let my PC hanged up, has memory enough, and with htop I can see how the memory is being consumed until it runs out of it and fails.

I guess the problem is not only related to the WARC, and maybe it is related to the model.

ZJaume commented 3 years ago

Well, I've run your model and loading the classifier eats the whole memory:

https://github.com/bitextor/bicleaner/blob/0dc06d3f805521abeda86602342af4fc0aaa288f/bicleaner/bicleaner_classifier_lite.py#L111

Your model has been trained with --classifier random_forest but training and saving a RandomForest classifier inside the python interpreter as Bicleaner does, does not result in such a problem. Could you upload your training file and the full training command? Also, have you tried training with --classifier extra_trees (which is the default, and it is heavier but better).

cgr71ii commented 3 years ago

Yes, I have used --classifed random_forest, and I have not try to use --classifier extra_trees since I would like to avoid to expending more time on training. The full command I have used is:

python3 ./bicleaner/bicleaner/bicleaner_train.py /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/train.y8WkIIaR -S ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l en -T ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l fr --treat_oovs --normalize_by_length -s en -t fr -d /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.e2f.gz -D /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.f2e.gz -f /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.en.filtered.vcb.gz -F /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.fr.filtered.vcb.gz -c /home/cgarcia/tmp/workdir-min/bicleaner-model/en-fr.classifier -m /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml --classifier_type random_forest

The command is the one being used in Bitextor. If you want to reproduce, maybe the easiest would be to execute directly Bitextor with greenpeaceaa.warc and in snake_performance branch. Anyway, if you want to reproduce directly, the bash script would be:

training=$(mktemp /tmp/train.XXXXXXXX)
paste <(xzcat -f /home/cgarcia/tmp/workdir-min/data/parallel-corpus/DGT/DGT.clipped.en-fr.en.xz) <(xzcat -f /home/cgarcia/tmp/workdir-min/data/parallel-corpus/DGT/DGT.clipped.en-fr.fr.xz,) > $training
DIR=$(dirname /tmp/new-en-fr.yaml)
lines=$(cat $training | wc -l)
trainlines=$(echo \"$lines*4/10\" | bc)
python3 ./bicleaner/bicleaner/bicleaner_train.py /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/train.y8WkIIaR -S ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l en -T ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l fr --treat_oovs --normalize_by_length -s en -t fr -d /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.e2f.gz -D /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.f2e.gz -f /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.en.filtered.vcb.gz -F /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.fr.filtered.vcb.gz -c /home/cgarcia/tmp/workdir-min/bicleaner-model/en-fr.classifier -m /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml --classifier_type random_forest

All necessary files are attached.

files.tar.gz

ZJaume commented 3 years ago

I'm unable to reproduce the same error, all the classifiers that I've trained with your command and your data are loaded correctly. It seems to be a corruption of the classifier file that you trained. Even so, I'm capable of decompressing it and then loading successfully:

zlib-flate -uncompress < en-fr.classifier > en-fr.classifier.dec
ipython
In [3]: import joblib
In [4]: joblib.load('en-fr.classifier.dec')
Out[4]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

If it's not a corrupted file, it should be a joblib bug. But I'm not able to reproduce it, so I can't report it. Have you experienced the same error if you train it again?

lpla commented 3 years ago

Are you both using the same code branch and the same requirements.txt? Also, which Python version are you using? Just to be sure we are all in the same playground.

ZJaume commented 3 years ago

I've been using Bicleaner 0.14, which is the same as the submodule in snake_performance branch (well 2 commits of difference, but none of them change any line of python code) and Python 3.6.9.

cgr71ii commented 3 years ago

I've been making tests, and it seems that the problem is related to the fact that I'm running 2 parallel processes that are running the bicleaner training. When I don't run these processes at the same time, it doesn't fail, but when I do, it fails (concretely, I'm running the processes with & at the end of the script).

Is it possible that, if Bicleaner training is executed in parallel, fails due to the fact of being trained in parallel? Is there any file which is being used across multiple instances? I think that this might be the reason, since if the same file is being used by multiple processes, and if this hypothetical file has been opened in write mode, one process might been making the other to fail.

This problem might be a condition race since the run-tests.sh which I had been using didn't thrown any problem, and now that I'm using basicaly the same tests, but with a WARC which is splitted, is failing.

BTW, my dependencies are the ones defined in the requirements.txt file of Bitextor in branch snake_performance. I am using Python 3.8.5, so our versions mismatch, but I don't think that is the problem.

ZJaume commented 3 years ago

There are no common files being used between training instances unless the training instances share filepaths in its parameters. For example, if both save classifier file to the same path. This is what could have happened.

cgr71ii commented 3 years ago

Ok, problem found. The problem was exactly what you described: the classifier was being stored in the same directory, and since they have both the same name (i.e. {lang1}-{lang2}.classifier), one instance is storing its classifier over the other instance classifier.

In the end, was a not an issue, but a misconception about how the training files were being generated. Sorry :(