bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
148 stars 22 forks source link

en-zh model #23

Closed Syrkovski closed 4 years ago

Syrkovski commented 4 years ago

Hello, as far as I could see there isn't trained en-zh model. Would you give me some advice for training that model?

mbanon commented 4 years ago

Hi @Syrkovski ! Currently we only offer pre-trained models for EU languages. You can read our guide on Bicleaner training here: https://github.com/bitextor/bicleaner/wiki/How-to-train-your-Bicleaner

Syrkovski commented 4 years ago

I followed this article, but dictionaries are not formed correctly. It has the following form: afterwards NULL 0.0000124 pension NULL 0.0000372 truss NULL 0.0000124 birthday NULL 0.0000744 commemorate NULL 0.0000248

If I swap languages then English is becaming NULL. I tried to build a dictionary for en-fr pair, but the same error appears there too.

mbanon commented 4 years ago

Maybe you have empty lines in your corpora?

Syrkovski commented 4 years ago

No, there are no empty lines.

mbanon commented 4 years ago

That's weird. Do all lines containe "NULL", or only at the beginning?

Syrkovski commented 4 years ago

Entire second column is "NULL"

mbanon commented 4 years ago

Then it seems to be something weird happening with Mgiza. Which command are you running?

Syrkovski commented 4 years ago

The command from article: "mosesdecoder/scripts/training/train-model.perl"

Syrkovski commented 4 years ago

The whole command: mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir bicleaner_inf/ --corpus bicleaner_inf/corpus.clean --e en --f zh --mgiza -mgiza-cpus 8 --parallel --first-step 1 --last-step 4 --external-bin-dir mgiza/mgizapp/bin/

First error to appear:

Merging A3.final.part tables Executing: enchmodels/mgiza/mgizapp/bin/merge_alignment.py enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final.part> enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final Traceback (most recent call last): File "enchmodels/mgiza/mgizapp/bin/merge_alignment.py", line 32, in st1 = files[i].readline(); File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 84: ordinal not in range(128) Exit code: 1

And after this the whole chunk of errors like:

Use of uninitialized value $a in scalar chomp at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 105

Use of uninitialized value in substitution (s///) at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 40.

mbanon commented 4 years ago

Either your corpus is wrongly encoded, or that's a Moses/Mgiza error. Maybe they (https://github.com/moses-smt/) can help you better.

Syrkovski commented 4 years ago

What encoding should the corpus have?

mbanon commented 4 years ago

https://github.com/moses-smt/mgiza/blob/master/mgizapp/scripts/merge_alignment.py#L66 Seems that UTF-8

Syrkovski commented 4 years ago

The command used in "How to train your Bicleaner" for training:

python3.7 bicleaner/bicleaner_train.py trainingcorpus.en-is --treat_oovs --normalize_by_length -s en -t is -d dict-en.gz -D dict-is.gz -b 1000 -c en-is.classifier -g 50000 -w 50000 -m en-is.yaml --classifier_type random_forest --noisy_examples_file_sl noisy.en-is.en --noisy_examples_file_tl noisy.en-is.is --lm_training_file_sl lmtrain.en-is.en --lm_training_file_tl lmtrain.en-is.is --lm_file_sl model.en-is.en --lm_file_tl model.en-is.is

But lmtrain.en-is.en, lmtrain.en-is.is, model.en-is.en, model.en-is.is files were not generated anythere. How I could get that files for my model?

mbanon commented 4 years ago

model.en-is.en and model.en-is.is are output parameters, they are trained and produced by Bicleaner.

lmtrain.en-is.en and lmtrain.en-is.is are also treated as output parameters, but you won't get them as a result of training (they are only used internally, you won't need them afterwards anyway). I know these are confusing, we should change this in future versions (they should be optional parameters).

Syrkovski commented 4 years ago

I trained Bicleaner model and applied it to one of my datasets, so I got this probability distribution:

изображение

Is it normal that I don’t have high probability values?

mbanon commented 4 years ago

Maybe. We never tried Bicleaner for chinese, so I don't know what distribution of probabilities should we expect. Anyway, you can lower your threeshold (we currently set it to 0.7) if you consider your data above a certain level (i.e. 0.5) is clean enough. A good starting point is inspecting the generated .yaml file, loooking at the max "accuracy_histogram" (each number in this histogram correspond to a range, starting in [0.0 - 0.1] and ending in [0.9. - 1.0] For example:

accuracy_histogram: [0.5000000, 0.5001500, 0.7084000, 0.9285000, 0.9807500, 0.9810500, 0.9649000, 0.9332500, 0.8389000, 0.5000000]

The highest value is 0.9810500, that correspond to the range of scores between 0.5 and 0.6 , so in this case, 0.6 is (in theory) the optimal threeshold.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had recent activity. Thank you for your contributions.