Closed cgr71ii closed 3 years ago
I tried 0.gz
myself with the French model and does not fail, so I'm assuming you could have less than ~2GB of memory available on your machine? The Bicleaner 0.14 models are quite large due to Extremely Randomized Trees classifier but it shouldn't use more than 2.5GB, only 10K sentences are loaded into memory at each time.
Ok, I have tried again and it worked with the French model. I though that did not work before because I let Bitextor running and it got frozen, then I guessed it had happened because this problem that I was debugging. Even so, I have checked out again that it consumes all the available memory if model.tar.gz
is used (if you are going to give it a try, you will need to modify the paths of new-en-fr.yaml
file). The machine which I am using, before let my PC hanged up, has memory enough, and with htop
I can see how the memory is being consumed until it runs out of it and fails.
I guess the problem is not only related to the WARC, and maybe it is related to the model.
Well, I've run your model and loading the classifier eats the whole memory:
Your model has been trained with --classifier random_forest
but training and saving a RandomForest classifier inside the python interpreter as Bicleaner does, does not result in such a problem. Could you upload your training file and the full training command? Also, have you tried training with --classifier extra_trees
(which is the default, and it is heavier but better).
Yes, I have used --classifed random_forest
, and I have not try to use --classifier extra_trees
since I would like to avoid to expending more time on training. The full command I have used is:
python3 ./bicleaner/bicleaner/bicleaner_train.py /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/train.y8WkIIaR -S ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l en -T ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l fr --treat_oovs --normalize_by_length -s en -t fr -d /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.e2f.gz -D /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.f2e.gz -f /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.en.filtered.vcb.gz -F /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.fr.filtered.vcb.gz -c /home/cgarcia/tmp/workdir-min/bicleaner-model/en-fr.classifier -m /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml --classifier_type random_forest
The command is the one being used in Bitextor. If you want to reproduce, maybe the easiest would be to execute directly Bitextor with greenpeaceaa.warc
and in snake_performance
branch. Anyway, if you want to reproduce directly, the bash script would be:
training=$(mktemp /tmp/train.XXXXXXXX)
paste <(xzcat -f /home/cgarcia/tmp/workdir-min/data/parallel-corpus/DGT/DGT.clipped.en-fr.en.xz) <(xzcat -f /home/cgarcia/tmp/workdir-min/data/parallel-corpus/DGT/DGT.clipped.en-fr.fr.xz,) > $training
DIR=$(dirname /tmp/new-en-fr.yaml)
lines=$(cat $training | wc -l)
trainlines=$(echo \"$lines*4/10\" | bc)
python3 ./bicleaner/bicleaner/bicleaner_train.py /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/train.y8WkIIaR -S ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l en -T ./preprocess/moses/tokenizer/tokenizer.perl -q -b -a -l fr --treat_oovs --normalize_by_length -s en -t fr -d /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.e2f.gz -D /home/cgarcia/tmp/workdir-min/permanent/en-fr.dic.generated.lex.f2e.gz -f /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.en.filtered.vcb.gz -F /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/tempgizamodel.en-fr/corpus.fr.filtered.vcb.gz -c /home/cgarcia/tmp/workdir-min/bicleaner-model/en-fr.classifier -m /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml --classifier_type random_forest
All necessary files are attached.
I'm unable to reproduce the same error, all the classifiers that I've trained with your command and your data are loaded correctly. It seems to be a corruption of the classifier file that you trained. Even so, I'm capable of decompressing it and then loading successfully:
zlib-flate -uncompress < en-fr.classifier > en-fr.classifier.dec
ipython
In [3]: import joblib
In [4]: joblib.load('en-fr.classifier.dec')
Out[4]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='entropy', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=0, verbose=0,
warm_start=False)
If it's not a corrupted file, it should be a joblib bug. But I'm not able to reproduce it, so I can't report it. Have you experienced the same error if you train it again?
Are you both using the same code branch and the same requirements.txt
? Also, which Python version are you using? Just to be sure we are all in the same playground.
I've been using Bicleaner 0.14, which is the same as the submodule in snake_performance
branch (well 2 commits of difference, but none of them change any line of python code) and Python 3.6.9.
I've been making tests, and it seems that the problem is related to the fact that I'm running 2 parallel processes that are running the bicleaner training. When I don't run these processes at the same time, it doesn't fail, but when I do, it fails (concretely, I'm running the processes with &
at the end of the script).
Is it possible that, if Bicleaner training is executed in parallel, fails due to the fact of being trained in parallel? Is there any file which is being used across multiple instances? I think that this might be the reason, since if the same file is being used by multiple processes, and if this hypothetical file has been opened in write mode, one process might been making the other to fail.
This problem might be a condition race since the run-tests.sh
which I had been using didn't thrown any problem, and now that I'm using basicaly the same tests, but with a WARC which is splitted, is failing.
BTW, my dependencies are the ones defined in the requirements.txt
file of Bitextor in branch snake_performance
. I am using Python 3.8.5, so our versions mismatch, but I don't think that is the problem.
There are no common files being used between training instances unless the training instances share filepaths in its parameters. For example, if both save classifier file to the same path. This is what could have happened.
Ok, problem found. The problem was exactly what you described: the classifier was being stored in the same directory, and since they have both the same name (i.e. {lang1}-{lang2}.classifier
), one instance is storing its classifier over the other instance classifier.
In the end, was a not an issue, but a misconception about how the training files were being generated. Sorry :(
When doing some tests, I have detected that Bicleaner makes the computer to freeze because it consumes all available memory without end. The issue seems to not be affected by the concrete Bicleaner model since I have used 2 different models (one created through training and https://github.com/bitextor/bicleaner-data/releases/download/v1.4/en-fr.tar.gz). The file which makes to fail is
bicleaner_classifier_lite.py
, and is invoked like it is done in Bitextor.I have checked out that also fails when only
bicleaner_classifier_lite.py
is invoked with the parameters--score_only -q - -
, and the attached file0.gz
is piped withzcat
. The exact command which fails is:zcat /home/cgarcia/tmp/workdir-min/transient-genbicleaner-en-fr/en_fr/hunalign.06_02.segalign/0.gz | python3 ~/bitextor/bicleaner/bicleaner/bicleaner_classifier_lite.py --score_only -q - - /home/cgarcia/tmp/workdir-min/bicleaner-model/new-en-fr.yaml
(the modelnew-en-fr.yaml
is attached too inmodel.tar.gz
, which is the one I got through training, but as I have said, I have checked out with the model provided in Bicleaner Data as well).The reason seems to be the use of the warc
greenpeaceaa.warc.gz
, which is a WARC that I have splitted from the originalgreenpeace.warc.gz
(this WARC does not fail, and is ~20 times bigger than the splitted version). When I usewarcio check greenpeaceaa.warc.gz
it seems to be a valid WARC, and I have splitted this WARC usingsplit-warc.py
from Bitextor.0.gz greenpeaceaa.warc.gz model.tar.gz