bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Probabilistic dictionaries #58

Closed hlhlpfp closed 3 years ago

hlhlpfp commented 3 years ago

After l run this command for creating probabilistic dictionaries: mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /your/working/directory --corpus bigcorpus.en-fr.clean -e en -f fr --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /your/path/here/mgiza/mgizapp/bin/ Where bigcorpus.en-fr.clean is the input file. ( it contains english and french sentences separated by tab) the output seems so strange to me. It is like below:

article NULL 0.0000005 copier NULL 0.0000000 trochlee NULL 0.0000001 schreber NULL 0.0000000

It is really strange because i was expecting this kind of results:

rediscover enduruppgötva 0.3333333 rediscover verðskuldið 0.1250000 (as in your example)

ZJaume commented 3 years ago

What do you mean? All of your dictionary entries are NULL? The NULL entries are an expected behaviour unless you have something like ~60% or more of the entries to NULL. The example is showing actual word alignments because it is the relevant information, not because the dictionary isn't supposed to have NULLs.

Looking at your probabilities, they look fine to me. Those entries are saying that that words have very low probability of being translated as NULL, which is expected for nouns.

hlhlpfp commented 3 years ago

@ZJaume the probabilistic dictionaries are all with NULL included, and they look all like this: lex.e2f: NULL 23.152 1.0000000 NULL pp.31 1.0000000 NULL jandu 1.0000000 NULL journalist-informant 1.0000000 NULL corn-based 1.0000000 NULL foundedness 1.0000000 NULL seney 1.0000000 NULL 114295 1.0000000 NULL 438.25 1.0000000 NULL 17,333.51 1.0000000 NULL scrupulousness 1.0000000 NULL argentinas 1.0000000 NULL p.72 1.0000000 NULL ratio-integrated 1.0000000 NULL 1983-12-16 1.0000000 NULL 1990-09-27 1.0000000 NULL 5subsequent 1.0000000 NULL scoreboard-sized 1.0000000 NULL 2,152,161 1.0000000 NULL irpr.5 1.0000000 NULL ft 1.0000000 NULL 0718-332 1.0000000 NULL property-holders 1.0000000 NULL 184,586 1.0000000 NULL inter-occupational 1.0000000 NULL imos 1.0000000 NULL singers 1.0000000 NULL kareemullah 1.0000000 NULL rabanillo 1.0000000 NULL influencial 1.0000000 NULL 3904.30.10 1.0000000 NULL parrales 1.0000000 NULL 105231 1.0000000 NULL 2712.90.90 1.0000000 NULL lemky 1.0000000 NULL 2.3this 1.0000000 NULL ice-tech 1.0000000 NULL 45,050 1.0000000 NULL hoeschen 1.0000000 NULL ezcurra 1.0000000 NULL slaughtered 1.0000000 NULL 147copy 1.0000000 NULL r56 1.0000000 NULL pillowslips 1.0000000 NULL 91702 1.0000000 NULL kidal 1.0000000 NULL 0607on0087 1.0000000

and lex.f2e: NULL q.le 1.0000000 NULL gd7-29-39 1.0000000 NULL article2667 1.0000000 NULL schréber 1.0000000 NULL copier 1.0000000 NULL triange 1.0000000 NULL coton-tiges 1.0000000 NULL ac3 1.0000000 NULL 0,5321 1.0000000 NULL térébenthines 1.0000000 NULL 2-452 1.0000000 NULL 38.1réduction 1.0000000 NULL trochlée 1.0000000 NULL féculent 1.0000000 NULL 9024.10 1.0000000 NULL 11.1inapplication 1.0000000 NULL 386694 1.0000000 NULL 193demandes 1.0000000 NULL 813,22 1.0000000 NULL 5pratiques 1.0000000 NULL sukhvir 1.0000000 NULL 149,968.30 1.0000000 NULL 670.8 1.0000000

and all the rows look like this. Do you think this problem is related with input file requirements, and in this case which are the requirements for bigcorpus.en-fr.clean? your suggestions would help me a lot. Thank you

ZJaume commented 3 years ago

This should be a problem in your file, are you sure that it is correctly formatted as tab-separated file(tsv)?

hlhlpfp commented 3 years ago

@ZJaume Sorry to ask these questions but which is the best command you suggest to create tab separated files? (linux/python) l have used this: paste -d"\t" one.txt two.txt > newfile.txt but seems not to work correctly :/ It is a huge file with 10 million sentences

ZJaume commented 3 years ago

paste is nice, but the tab is already the default delimiter, you don't need to specify it. Did you check that both files have the same number of lines and that there are no tabs in the middle os sentences or empty sentences?

ZJaume commented 3 years ago

Sorry but I can't see anything relevant with that example. You have to ensure that the the two sides of the translation are separated by a tab and that there are no other tabs in the text.