Closed ghost closed 4 years ago
Hi @sugiyamath !
Thank you for letting us know about this issue.
After inspecting data, we found out that the en-ru language pack is completely wrong (probably from an older version of bicleaner), and shouldn't be there, so it has just been removed.
The correct format is the one in en-fr, so for example, in dict-fr.gz
:
using utilisaient 0.0431034
the probabilty of "utilisaient" (fr) being translated as "using" (en) is 0.04
Sorry for any inconvenience, please let us know if you have any other question.
Hi, @mbanon I have another question about bicleaner on bitextor. According to bitextor's Snakefile, it uses mgiza to create a probabilistic dictionary:
rule mgiza:
input:
vcb1="{prefix}.{l1}.vcb",
vcb2="{prefix}.{l2}.vcb",
snt="{prefix}.{l2}-{l1}-int-train.snt",
cooc="{prefix}.{l2}-{l1}.cooc"
output:
"{prefix}.{l2}-{l1}.t3.final"
shell:
"{PROFILING} {BITEXTOR}/mgiza/mgizapp/bin/mgiza -ncpus 8 -CoocurrenceFile {input.cooc} -c {input.snt} -m1 5 -m2 0 -m3 3 -m4 3 -mh 5 -m5 0 -model1dumpfrequency 1 -o {wildcards.prefix}.{wildcards.l2}-{wildcards.l1} -s {input.vcb1} -t {input.vcb2} -emprobforempty 0.0 -probsmooth 1e-7 2> /dev/null > /dev/null"
...
rule bicleaner_train_model:
input:
corpusl1=expand("{dataset}.{lang}.xz", dataset=bicleanerTrainPrefixes, lang=LANG1),
corpusl2=expand("{dataset}.{lang}.xz", dataset=bicleanerTrainPrefixes, lang=LANG2),
t3_1="{dir}/corpus.{l1}-{l2}.t3.final".format(dir=mgizaModelDir, l1=LANG1, l2=LANG2),
t3_2="{dir}/corpus.{l2}-{l1}.t3.final".format(dir=mgizaModelDir, l1=LANG1, l2=LANG2)
output:
"{model}".format(model=BICLEANER_CONFIG)
priority: 40
shell:
"training=$(mktemp {TMPDIR}/train.XXXXXXXX); "
"paste <(xzcat -f {input.corpusl1}) <(xzcat -f {input.corpusl2}) > $training; "
"DIR=$(dirname {BICLEANER_CONFIG}); "
"echo $DIR; "
"cp {input.t3_1} $DIR/{LANG1}.dic; "
"cp {input.t3_2} $DIR/{LANG2}.dic; "
"gzip $DIR/{LANG1}.dic $DIR/{LANG2}.dic; "
"lines=$(cat $training | wc -l); "
"trainlines=$(echo \"$lines*4/10\" | bc); "
"testlines=$(echo \"($lines-2*$trainlines)/2\" | bc); "
'{PROFILING} python3 {BITEXTOR}/bicleaner/bicleaner/bicleaner_train.py $training -S "{WORDTOK1}" -T "{WORDTOK2}" --treat_oovs --normalize_by_length -s {LANG1} -t {LANG2} -d $DIR/{LANG1}.dic.gz -D $DIR/{LANG2}.dic.gz -c $DIR/{LANG1}-{LANG2}.classifier -g $trainlines -w $trainlines --good_test_examples $testlines --wrong_test_examples $testlines -m {BICLEANER_CONFIG} --classifier_type random_forest; '
"rm $training"
https://github.com/bitextor/bitextor/blob/master/snakemake/Snakefile
corpus.{l2}-{l1}.t3.final
is a dictionary, but it is the same format as en-ru data, I think.
Does it work correctly?
Hi again, @sugiyamath ! To be honest, I have never trained Bicleaner by using Snakemake (I always train Bicleaner beforehand), so I cannot assure that it's working. Probably our Bitextor friends on charge of the Snakefile can bring some light into this matter (@lpla , @zuny26 , @mespla :wave: )
This is my recipe to build the probabilistic dictionaries from scratch:
You need a corpus. I usually download different corpora from Opus and concat them. From now, I am naming this corpus "bigcorpus.en-is" (also asuming I am building english-icelandic)
Split source (en) and target (is) sides:
cat bigcorpus.en-is | cut -f1 > bigcorpus.en-is.en
cat bigcorpus.en-is | cut -f2 > bigcorpus.en-is.is
Tokenize with Moses:
/home/mbanon/bitextor/preprocess/moses/tokenizer/tokenizer.perl -l en < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
/home/mbanon/bitextor/preprocess/moses/tokenizer/tokenizer.perl -l is < bigcorpus.en-is.is > bigcorpus.en-is.tok.is
Lowercase the whole tokenized thing:
tr '[:upper:]' '[:lower:]' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
tr '[:upper:]' '[:lower:]' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is
Change the names of the tokenized-lowercased corpus (this is required by mgiza):
cp bigcorpus.en-is.tok.low.en bigcorpus.en-is.clean.en
cp bigcorpus.en-is.tok.low.is bigcorpus.en-is.clean.is
Run Mgiza in this obscure way:
/home/mbanon/mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /home/mbanon/paracrawl-newlangs/en-is/ --corpus /home/mbanon/paracrawl-newlangs/en-is/bigcorpus.en-is.clean -f en -e is --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /home/mbanon/mgiza/mgizapp/bin/
The important parts here, to obtain the prob dicts, are the "first step" and "last step" parameters.
Change columns order in model/lex.e2f and model/lex.f2e:
cd model; awk '{print $2" " $1" " $3}' lex.e2f > lex.e2f_2 && mv lex.e2f_2 lex.e2f && awk '{print $2" " $1" " $3}' lex.f2e > lex.f2e_2 && mv lex.f2e_2 lex.f2e
lex.e2f
(prob dict en-.is) and lex.f2e
(is-en), fully compatible with Bicleaner, but you probably want to make them "lighter" by removing very uncommon translations (let's say, 10 times lower than the maximum one)And that's it, I hope it helps you!. I really want to write a full tutorial on training Bicleaner from scratch, hopefully it will happen soon :)
Is there a way to obtain lex.e2f and lex.f2e without having Moses as a dependency in Bitextor? Looks like we can use the "corpus.SL-TL.t3.final" and "corpus.??.vcb" files from mgiza
to obtain the prob dicts. Also, this way, we can customize the filters of uncommon translations on-the-go.
@lpla No idea, never tried before :woman_shrugging:
Okay, we found out that we can do that in an already implemented rule in Bitextor. But one more question. Probabilistic dictionary must be generated with different corpora than the one used in Bicleaner model training, isn't it?
It's not mandatory. The more lines in the corpus used for dictionaries, the better (I use corpora of around a few million lines). And for the training corpus, the cleaner the better (using around 100K). So, you probably want a bigger corpus for dictionaries, even if it's "not as clean" as the training one.
I think I fixed the dictionary creation rule in Bitextor, which should now output a proper Russian Bicleaner model instead of the one uploaded previously. So this fixes @sugiyamath initial issue in general. I just implemented it in https://github.com/bitextor/bitextor/commit/9007c5e15f045b3e6fd520330022ad97b7a7ac2a
But, before releasing this in a new Bitextor version, I would like to properly test it. @mbanon, could you provide me the original resources and/or applied filters to manually reproduce a small lang pair of the latest released Bicleaner models? So I can be totally sure that I am training the same way you do and also Bitextor.
It's still a WIP, but the important parts are already there: https://github.com/bitextor/bicleaner/wiki/How-to-train-your-Bicleaner
If you want to reproduce, for example, es-ca, these are the resources I used:
* Dictionaries from: DOGC, GNOME, QED, Tatoeba, OpenSubtitles (all from TMX, 5M lines in total)
* Training corpus: Global Voices + JW300 (93K lines)
* Test corpus: EUbookshop, KDE
The es-ca I trained is already in the Releases page (as a draft).
I reproduced the es-ca
model with those corpora to create a probabilistic dictionary and train a bicleaner classifier, following the Wiki 'How to'. I didn't manually test with EUbookshop and KDE as it is the same as yours.
Also, I checked that Bitextor produces correct and competitive Bicleaner models now, but they are a bit different as the filter used at the end of probabilistic dictionaries creation is different and a bit greedier (instead of filtering with dict_pruner.py
those entried whose probability is less than 10 times lower than the maximum one, we filter any entry below 0.1). This is herited from hunalign
dictionaries creation rule.
To keep this better documented, avoid confusion and target similar errors faster, if we upload a bicleaner model to bitextor-data
we will add the "-bitextor" suffix to note that those were generated by Bitextor.
Great, @lpla !
I downloaded two files from https://github.com/bitextor/bitextor-data/releases/tag/bicleaner-v1.1
but they have different format:
I assume that they have different denominators, so created a script for checking it:
and the output is this:
The max value should be less than 1.0 because it's a probability, so the correct ones are the second one and the third one, so en-fr and en-ru have different denominators, I think.
Could you please tell me why these data have different formats and are both data correct?