bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Improve debug mode #27

Closed ZJaume closed 4 years ago

ZJaume commented 4 years ago

Here is an example of the output :)

2019-12-13 12:36:29,022 - DEBUG - Feature importances: {
    "lang1": 0.005841166330305325,
    "lang2": 0.002177733140099974,
    "length1": 0.0018393377109550614,
    "length2": 0.006895309112765546,
    "npunct1": 1.6160376715444919e-06,
    "narabic1": 0.0,
    "ngreek1": 0.0,
    "nhan1": 0.0,
    "nhangul1": 0.0,
    "nhebrew1": 0.0,
    "nlatin1": 0.0001280386126467906,
    "nlatin_e_a1": 0.0,
    "nlatin_e_b1": 0.0,
    "nlatin_sup1": 0.0,
    "nbasic_latin1": 0.00018371815221868677,
    "nhiragana1": 0.0,
    "nkatakana1": 0.0,
    "ncyrillic1": 0.0,
    "ndevanagari1": 0.0,
    "nmalayalam1": 0.0,
    "nother1": 0.0,
    "npunct2": 0.0,
    "narabic2": 0.0,
    "ngreek2": 0.0,
    "nhan2": 0.0,
    "nhangul2": 0.0,
    "nhebrew2": 0.0,
    "nlatin2": 0.00012613284449739263,
    "nlatin_e_a2": 0.0,
    "nlatin_e_b2": 0.0,
    "nlatin_sup2": 0.0,
    "nbasic_latin2": 0.0008568029659436135,
    "nhiragana2": 0.0,
    "nkatakana2": 0.0,
    "ncyrillic2": 0.0,
    "ndevanagari2": 0.0,
    "nmalayalam2": 0.0,
    "nother2": 0.0,
    "ndif1": 0.001523494204428415,
    "freq11": 0.001406614602985517,
    "freq21": 0.005332464895349358,
    "freq31": 0.0005020597955617962,
    "ent1": 6.155643282831637e-05,
    "maxrep1": 6.32788337038493e-05,
    "maxword1": 3.0424868144495794e-05,
    "ndif2": 0.000159781805126115,
    "freq12": 0.0023929492728492004,
    "freq22": 5.5103078751242024e-06,
    "freq32": 0.0007594188450948744,
    "ent2": 0.00024451987143109347,
    "maxrep2": 0.0,
    "maxword2": 0.0002850152431698259,
    "ntok1": 0.0043459612087610976,
    "ntok2": 0.0024595708037800862,
    "poisson1": 0.10217937463355964,
    "poisson2": 0.0805713587249015,
    "qmax1": 0.12493773189954763,
    "cov11": 0.0006918506559960689,
    "cov21": 0.16960520328783918,
    "qmax2": 0.16127769443900902,
    "cov12": 0.001726363470258008,
    "cov22": 0.1430992527845473,
    "avg_tok_l1": 6.54267639348768e-05,
    "avg_tok_l2": 1.3524667740810183e-05,
    "npunct_tok1": 0.0019308474253064679,
    "npunct_tok2": 1.6583716874251042e-05,
    "comma1": 0.0012449941009143456,
    "period1": 0.003476725194894204,
    "semicolon1": 0.0,
    "colon1": 0.0,
    "doubleapos1": 0.0,
    "quot1": 0.0017037327582153077,
    "slash1": 0.0,
    "comma2": 0.001183466761656478,
    "period2": 0.002634447897066741,
    "semicolon2": 0.0,
    "colon2": 8.156491048558853e-06,
    "doubleapos2": 0.0,
    "quot2": 0.0016575005052045452,
    "slash2": 0.0,
    "numeric_expr1": 0.003239605864387509,
    "numeric proportion_preserved1": 0.03808513632043931,
    "numeric_expr2": 8.994963011227935e-05,
    "numeric_proportion_preserved2": 0.02492141792729444,
    "uppercase1": 0.0003386449654141953,
    "capital_proportion_preserved1": 0.05510153750111379,
    "uppercase2": 0.00022487097765143586,
    "capital_proportion_preserved2": 0.04235212473687902
}
mbanon commented 4 years ago

Great, I will add it for next version :+1:

ZJaume commented 4 years ago

Trainer lite version was missing.

ZJaume commented 4 years ago

Sorting features requires Python 3.6+, if there is something wrong with that requirement I could sort features manually.

ZJaume commented 4 years ago

I noticed that the code of this PR does nothing in 0.13 because debug is disabled explicitly during training d49babd9012c5c603c533cb1dc6e5ba8c0e9b61d.

Edit: We could report it on INFO or report feature importances after training.

ZJaume commented 4 years ago

It can be merged :smile: