bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Training instructions lacking... #16

Closed phikoehn closed 5 years ago

phikoehn commented 5 years ago

I downloaded the en-de model, and am now trying to replicate training.

I had to make some guesses (e.g., how to specify the training data, what the switches -m and -c need), so I am running this in an attempt to replicate the provided model:

bicleaner-train -m en-de.yaml.my -c en-de.classifier.my -s en -t de -d dict-en.gz -D dict-de.gz --lm_file_sl en-de.model.en --lm_file_tl en-de.model.de train.en-de --normalize_by_length --treat_oovs

The log reports a few "spawning processes" (tokenization?) and then killing these processes but it does produce a model (en-de.classifier.my) after about a minute.

The following instructions are missing:

(1) How to create the dictionary files I am assuming that this comes from the lex file - is there a script that does the expected filtering of removing translations with <.1 of best prob? The instructions say that these files should be tab-separated, but the provided dictionaries are space-separated.

(2) How to train the language model The yaml file says "lm_type: CHARACTER". Does this mean that it is a character-based language model?

I assume it is also possible to let the script train a language model but the following did not do it:

bicleaner-train -m en-de.yaml.my -c en-de.classifier.my -s en -t de -d dict-en.gz -D dict-de.gz --lm_file_sl en-de.model.en.my --lm_file_tl en-de.model.de.my --normalize_by_length --treat_oovs --lm_training_file_sl train.en --lm_training_file_tl train.de train.en-de

Do I have to manually add the file names of the language model files to the meta file?

(3) the resulting model is different - how can i exactly replicate it?

(4) Is there any other preprocessing besides tokenization (e.g., truecasing, lowercasing)?

mbanon commented 5 years ago

Hi Philip, thanks for your comments, I'll take them into account to improve the documentation.

The processes being spawned and killed are due to training parallelization. This can be avoided by running "bicleaner-train-lite" instead of "bicleaner-train" (lite versions do not use parallelization).

(1) Yes, these are lex files. We have an "experimental" script that performs some operations on the lex files and removes entries with low probabilities. That script is too "alpha", so we haven't released it. And you are right, that files are space separated.

(2) Yes, it means that bicleaner-train is using a character based language model. Please note that bicleaner-train does not train language models but a classifier, so the language models have to be provided if you plan to use them for training. Provided language models are automatically added to the yaml file.

(3) The model (classifier) is different on each training because the algorithm used is random forest, which is, (as expected :) ), random, so it's non-deterministic (unless maybe if you fix a particular seed to the random engine)

(4) Sentences are lowercased internally to calculate some (not all) of the features.

vitaka commented 5 years ago

Let me add a few clarifications:

(1) Dictionary fields are space separated: I corrected the README to make it clear.

(2) bicleaner-train DOES train the character language models too. That's why we stress in the documentation that "lmplz" must be in the PATH and you do not need to manually add the file names of the language model files.

In order to use the language models for filtering out noisy text we need:

There are two ways we can give Bicleaner this information:

a) We can use the very same corpus used to train the classifier, which we assume it is clean enough, to train the LMs. We can substract a small portion of that corpus to build the "dev clean" corpora. We need to provide the "dev noisy" corpora. Hence, we use the options: --lm_file_sl --lm_file_tl to tell Bicleaner where to store the language models that are trained (they are automatically added to the YAML configuration file), and --noisy_examples_file_sl --noisy_examples_file_tl to specify the path to the "dev noisy" corpora. In order to train the models we released, we extracted the "dev noisy" corpora by applying the hard rules to the raw corpora, as explained in the README. Optionally, we can specify the number of sentences substracted from the corpus used to train the classifier in order to build the "dev clean" corpora with the option --lm_dev_size (by default, it is 2000 sentences).

b) We can specify the exact train, "dev clean", and "dev noisy" corpora used for LM filtering. We need to add the options --lm_file_sl and --lm_file_tl as before, --lm_training_file_sl and --lm_training_file_tl for selecting the corpora for LM training, --lm_clean_examples_file_sl and --lm_clean_examples_file_tl for selecting the "dev clean", and --noisy_examples_file_sl --noisy_examples_file_tl for selecting the "dev noisy".

phikoehn commented 5 years ago

Thank you - this is helpful.

Here is where I am:

(1) When I run bicleaner-hardrules -s en -t de --tmp_dir . train.en-de noisy I get an output file "noisy" that has exactly the same sentences as train.en-de (column 3 is either 0.00000000 or empty, column 4 is always "discard") So, this does not give me a noisy corpus.

(2) This command seems to be doing all there needs to be done, right? bicleaner-train \ -m en-de.yaml \ -c en-de.classifier \ -s en -t de \ -d dict-en.gz -D dict-de.gz \ --lm_file_sl en-de.model.en \ --lm_file_tl en-de.model.de \ --normalize_by_length \ --treat_oovs \ --lm_training_file_sl train.en \ --lm_training_file_tl train.de \ --lm_dev_size 2000 \ --noisy_examples_file_sl train.en.bottom2k \ --noisy_examples_file_tl train.de.bottom2k train.en-de I chose here the bottom 2000 sentence pairs from the training data as noisy examples which is clearly the wrong thing to do.

(3) To be clear - the "Killing" of processes is just fine, it is not indication of a problem. right?

(4) I am not sure, if the (optional?) language-model-related scoring is in addition to the main Bicleaner score, or if it is folded into that number.

(5) While it is nice for a random forest to be random, it is still probably a good idea to set a fixed seed, so that runs can be exactly replicated. When applying bicleaner results also differ every single times, but possibly just by rounding errors.

(6) Are the provided models used for Paracrawl releases? Are there any changes that would give significantly better performance?

(7) Any suggestions on how to run this on languages such as Chinese for which you have to do word segmentation. My default solution is to run sentencepiece first but a better way is to use dedicated word segmentation tools first. How would this be integrated into bicleaner training and testing? Just specifying a custom tokenizer?

mbanon commented 5 years ago

@phikoehn I just noticed that you edited your message while I was writing the reply, so I replied to an older version of your comment, and now my replies does not have much sense with your current message. I'm deleting my messages, please see below the corrected version, thanks!

mbanon commented 5 years ago

(1) You are right, the documentation is not clear about this (I've just changed it). Those entries that are marked with "0.0000 discard" are the noisy ones. The ones with the 3rd and 4th column empty are the not noisy ones. You might want to use the "--annotated_output ANNOTATED_OUTPUT" option instead, that gives you a file with (only) the noisy sentences and the reason for it (e.g. one side is too short, both source and target are identical, etc.)

(2) Yes, the command is ok. But, instead of the bottom 2000, you can use the noisy sentences file provided by using "--annotated_output" with bicleaner_hardrules. Anyway, @vitaka might want to confirm this.

(3) Yes, it's the expected behaviour.

(4) Yes, using language models is optional, and using them does not add an extra score (but it's taken into account for getting Bicleaner score)

(5) We'll study using a fixed seed for a future version of Bicleaner.

(6) Yes, the models included in the downloadable language packs (https://github.com/bitextor/bitextor-data/releases/tag/bicleaner-v1.1) are those we used for the classification task of the release, but the files used on bicleaner-train to build them are NOT included.

Not included: --noisy_examples_file_sl --noisy_examples_file_tl --lm_training_file_sl --lm_training_file_tl --lm_clean_examples_file_sl --lm_clean_examples_file_tl

But I can provide them to you if you want to train yourself a Bicleaner that is as much similar as possible as the released one.

About the changes for better perfomance, do you mean adding extra features or changing current implementations? Currently, we are aware that using smaller dictionaries results in faster execution.

(7)
You can run sentencepiece and build a probabilistic dictionary segmented by sentencepiece. The possibility of using a custom tokenizer is already integrated: see options -S and -T in bicleaner-train and bicleaner-classify.

vitaka commented 5 years ago

Hi.

Some more clarifications follow:

(2): With that command you are obtaining LM training corpora and dev clean from "train.en-de" and dev noisy from the files you are specifying. --lm_training_file_sl an --lm_training_file_tl only work if --lm_clean_examples_file_sl and --lm_clean_examples_file_tl are also specified. If any of the 4 flags is not set, they are completely ignored. I modified Bicleaner to make it print detailed information about whether it is building the LM filter and which corpora it is actually using.

(4) The default behavior when Bicleaner is trained with LM filter is setting the output score to 0 if the sum of the perplexities of the SL and TL sentences are above a certain threshold. If you add "--keep_lm_result" to bicleaner-classify, you will get two scores per sentence, the Random Forest classifier score and the LM filter score. The first score will not be set to 0 if the perplexity is too high, so you can build a more sophisticated filtering strategy if you want. The LM filter score is normalized so that it is in the [0-1] range. Perplexities obtained from the dev clean and dev noisy sets are used to perform this normalization.

(7) Yes, changing the tokenizer should work for the Random Forest classifier part. But the external tokenizers defined with -S and -T must produce an output sentence each time they read an input one (as the Moses tokenizer does with the -b option). The LM filter is trickier since a Chinese character-level language model would not make a lot of sense...

Regards,

Víctor

mbanon commented 5 years ago

Hi @phikoehn , please let us know if your problem was solved, so we can close the issue. Thanks!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had recent activity. Thank you for your contributions.