bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
148 stars 22 forks source link

Why do the provided data, en-fr and en-ru, have different formats? #22

Closed ghost closed 4 years ago

ghost commented 4 years ago

I downloaded two files from https://github.com/bitextor/bitextor-data/releases/tag/bicleaner-v1.1

but they have different format:

  1. en-fr dict is sorted by second column.
  2. en-fr dict has raw words, not ids.
  3. en-ru dict is sorted by first column.
  4. en-ru dict has ids, not words.

I assume that they have different denominators, so created a script for checking it:

# coding: utf-8
import numpy as np

def load_dic(datafile, direction="left"):
    dic = {}
    with open(datafile) as f:
        for line in f:
            line = line.strip().split()
            if direction == "left":
                col1 = 1
                col2 = 0
            elif direction == "right":
                col1 = 0
                col2 = 1
            if line[col1] not in dic:
                dic[line[col1]] = {}
            else:
                dic[line[col1]][line[col2]] = float(line[2])
    return dic

def calc_total_prob(dic):
    total = []
    for word1, words in dic.items():
        total.append(sum([prob for word2, prob in words.items()]))
    return total

def check(datafile, direction):
    dic = load_dic(datafile, direction)
    total = calc_total_prob(dic)
    print("max", max(total))
    print("min", min(total))
    print("mean", np.mean(total))
    print()

print("[check bicleaner provided data]")
print("en-fr (assuming that the denominator is count of col 0")
check("en-fr/dict-en", "right")

print("en-fr (assuming that the denominator is count of col 1")
check("en-fr/dict-en", "left")

print("en-ru (assuming that the denominator is count of col 0")
check("en-ru/en.dic", "right")

print("en-fr (assuming that the denominator is count of col 1")
check("en-ru/en.dic", "left")

and the output is this:

en-fr (assuming that the denominator is count of col 0
max 5348.389211700034
min 0
mean 0.27679697246372525

en-fr (assuming that the denominator is count of col 1
max 0.9928028000000019
min 0
mean 0.40413576427989834

en-ru (assuming that the denominator is count of col 0
max 1.000002
min 0
mean 0.8213153790770497

en-fr (assuming that the denominator is count of col 1
max 4569.333056612812
min 0
mean 2.3405763509305317

The max value should be less than 1.0 because it's a probability, so the correct ones are the second one and the third one, so en-fr and en-ru have different denominators, I think.

Could you please tell me why these data have different formats and are both data correct?

mbanon commented 4 years ago

Hi @sugiyamath ! Thank you for letting us know about this issue. After inspecting data, we found out that the en-ru language pack is completely wrong (probably from an older version of bicleaner), and shouldn't be there, so it has just been removed. The correct format is the one in en-fr, so for example, in dict-fr.gz:

using utilisaient 0.0431034

the probabilty of "utilisaient" (fr) being translated as "using" (en) is 0.04

Sorry for any inconvenience, please let us know if you have any other question.

ghost commented 4 years ago

Hi, @mbanon I have another question about bicleaner on bitextor. According to bitextor's Snakefile, it uses mgiza to create a probabilistic dictionary:

rule mgiza:
    input:
        vcb1="{prefix}.{l1}.vcb",
        vcb2="{prefix}.{l2}.vcb",
        snt="{prefix}.{l2}-{l1}-int-train.snt",
        cooc="{prefix}.{l2}-{l1}.cooc"
    output:
        "{prefix}.{l2}-{l1}.t3.final"
    shell:
        "{PROFILING} {BITEXTOR}/mgiza/mgizapp/bin/mgiza -ncpus 8 -CoocurrenceFile {input.cooc} -c {input.snt} -m1 5 -m2 0 -m3 3 -m4 3 -mh 5 -m5 0 -model1dumpfrequency 1 -o {wildcards.prefix}.{wildcards.l2}-{wildcards.l1} -s {input.vcb1} -t {input.vcb2} -emprobforempty 0.0 -probsmooth 1e-7 2> /dev/null > /dev/null"

...

rule bicleaner_train_model:
    input:
        corpusl1=expand("{dataset}.{lang}.xz", dataset=bicleanerTrainPrefixes, lang=LANG1),
        corpusl2=expand("{dataset}.{lang}.xz", dataset=bicleanerTrainPrefixes, lang=LANG2),
        t3_1="{dir}/corpus.{l1}-{l2}.t3.final".format(dir=mgizaModelDir, l1=LANG1, l2=LANG2),
        t3_2="{dir}/corpus.{l2}-{l1}.t3.final".format(dir=mgizaModelDir, l1=LANG1, l2=LANG2)
    output:
        "{model}".format(model=BICLEANER_CONFIG)
    priority: 40

    shell:
        "training=$(mktemp {TMPDIR}/train.XXXXXXXX); "
        "paste <(xzcat -f {input.corpusl1}) <(xzcat -f {input.corpusl2}) > $training; "
        "DIR=$(dirname {BICLEANER_CONFIG}); "
        "echo $DIR; "
        "cp {input.t3_1} $DIR/{LANG1}.dic; "
        "cp {input.t3_2} $DIR/{LANG2}.dic; "
        "gzip $DIR/{LANG1}.dic $DIR/{LANG2}.dic; "
        "lines=$(cat $training | wc -l); "
        "trainlines=$(echo \"$lines*4/10\" | bc); "
        "testlines=$(echo \"($lines-2*$trainlines)/2\" | bc); "
        '{PROFILING} python3  {BITEXTOR}/bicleaner/bicleaner/bicleaner_train.py $training -S "{WORDTOK1}" -T "{WORDTOK2}" --treat_oovs --normalize_by_length -s {LANG1} -t {LANG2} -d $DIR/{LANG1}.dic.gz -D $DIR/{LANG2}.dic.gz -c $DIR/{LANG1}-{LANG2}.classifier -g $trainlines -w $trainlines --good_test_examples $testlines --wrong_test_examples $testlines -m {BICLEANER_CONFIG} --classifier_type random_forest; '
        "rm $training"

https://github.com/bitextor/bitextor/blob/master/snakemake/Snakefile

corpus.{l2}-{l1}.t3.final is a dictionary, but it is the same format as en-ru data, I think.

Does it work correctly?

mbanon commented 4 years ago

Hi again, @sugiyamath ! To be honest, I have never trained Bicleaner by using Snakemake (I always train Bicleaner beforehand), so I cannot assure that it's working. Probably our Bitextor friends on charge of the Snakefile can bring some light into this matter (@lpla , @zuny26 , @mespla :wave: )

This is my recipe to build the probabilistic dictionaries from scratch:

  1. You need a corpus. I usually download different corpora from Opus and concat them. From now, I am naming this corpus "bigcorpus.en-is" (also asuming I am building english-icelandic)

  2. Split source (en) and target (is) sides:

    cat bigcorpus.en-is | cut -f1 > bigcorpus.en-is.en
    cat bigcorpus.en-is | cut -f2 > bigcorpus.en-is.is
  3. Tokenize with Moses:

    /home/mbanon/bitextor/preprocess/moses/tokenizer/tokenizer.perl -l en < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
    /home/mbanon/bitextor/preprocess/moses/tokenizer/tokenizer.perl -l is < bigcorpus.en-is.is > bigcorpus.en-is.tok.is
  4. Lowercase the whole tokenized thing:

    tr '[:upper:]' '[:lower:]' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
    tr '[:upper:]' '[:lower:]' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is
  5. Change the names of the tokenized-lowercased corpus (this is required by mgiza):

    cp bigcorpus.en-is.tok.low.en bigcorpus.en-is.clean.en
    cp bigcorpus.en-is.tok.low.is bigcorpus.en-is.clean.is
  6. Run Mgiza in this obscure way:

    /home/mbanon/mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /home/mbanon/paracrawl-newlangs/en-is/ --corpus /home/mbanon/paracrawl-newlangs/en-is/bigcorpus.en-is.clean  -f en -e is --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /home/mbanon/mgiza/mgizapp/bin/

    The important parts here, to obtain the prob dicts, are the "first step" and "last step" parameters.

  7. Change columns order in model/lex.e2f and model/lex.f2e:

cd model; awk '{print $2" " $1" " $3}' lex.e2f > lex.e2f_2 && mv lex.e2f_2 lex.e2f && awk '{print $2" " $1" " $3}' lex.f2e > lex.f2e_2 && mv lex.f2e_2 lex.f2e
  1. At this point, you have lex.e2f (prob dict en-.is) and lex.f2e (is-en), fully compatible with Bicleaner, but you probably want to make them "lighter" by removing very uncommon translations (let's say, 10 times lower than the maximum one)

And that's it, I hope it helps you!. I really want to write a full tutorial on training Bicleaner from scratch, hopefully it will happen soon :)

lpla commented 4 years ago

Is there a way to obtain lex.e2f and lex.f2e without having Moses as a dependency in Bitextor? Looks like we can use the "corpus.SL-TL.t3.final" and "corpus.??.vcb" files from mgiza to obtain the prob dicts. Also, this way, we can customize the filters of uncommon translations on-the-go.

mbanon commented 4 years ago

@lpla No idea, never tried before :woman_shrugging:

lpla commented 4 years ago

Okay, we found out that we can do that in an already implemented rule in Bitextor. But one more question. Probabilistic dictionary must be generated with different corpora than the one used in Bicleaner model training, isn't it?

mbanon commented 4 years ago

It's not mandatory. The more lines in the corpus used for dictionaries, the better (I use corpora of around a few million lines). And for the training corpus, the cleaner the better (using around 100K). So, you probably want a bigger corpus for dictionaries, even if it's "not as clean" as the training one.

lpla commented 4 years ago

I think I fixed the dictionary creation rule in Bitextor, which should now output a proper Russian Bicleaner model instead of the one uploaded previously. So this fixes @sugiyamath initial issue in general. I just implemented it in https://github.com/bitextor/bitextor/commit/9007c5e15f045b3e6fd520330022ad97b7a7ac2a

But, before releasing this in a new Bitextor version, I would like to properly test it. @mbanon, could you provide me the original resources and/or applied filters to manually reproduce a small lang pair of the latest released Bicleaner models? So I can be totally sure that I am training the same way you do and also Bitextor.

mbanon commented 4 years ago

It's still a WIP, but the important parts are already there: https://github.com/bitextor/bicleaner/wiki/How-to-train-your-Bicleaner

If you want to reproduce, for example, es-ca, these are the resources I used:

* Dictionaries from: DOGC, GNOME, QED, Tatoeba, OpenSubtitles (all from TMX, 5M lines in total)
* Training corpus: Global Voices + JW300 (93K lines)
* Test corpus: EUbookshop, KDE

The es-ca I trained is already in the Releases page (as a draft).

lpla commented 4 years ago

I reproduced the es-ca model with those corpora to create a probabilistic dictionary and train a bicleaner classifier, following the Wiki 'How to'. I didn't manually test with EUbookshop and KDE as it is the same as yours.

Also, I checked that Bitextor produces correct and competitive Bicleaner models now, but they are a bit different as the filter used at the end of probabilistic dictionaries creation is different and a bit greedier (instead of filtering with dict_pruner.py those entried whose probability is less than 10 times lower than the maximum one, we filter any entry below 0.1). This is herited from hunalign dictionaries creation rule.

To keep this better documented, avoid confusion and target similar errors faster, if we upload a bicleaner model to bitextor-data we will add the "-bitextor" suffix to note that those were generated by Bitextor.

mbanon commented 4 years ago

Great, @lpla !