bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Bifixer doesn't work with new ftfy >=6.0 #5

Closed lpla closed 3 years ago

lpla commented 3 years ago

Running Bifixer through Bitextor automatic tests, shown that it won't work using last month releases of ftfy >=6.0. This is the error:

(log test 101) rule bifixer:
(log test 101)     input: /home/runner/work/bitextor/bitextor/transient-mto2-en-fr/en_fr/06_02.segalign/0.gz
(log test 101)     output: /home/runner/work/bitextor/bitextor/transient-mto2-en-fr/en_fr/07_01.bifixer/0
(log test 101)     jobid: 26
(log test 101)     wildcards: batch=0
(log test 101) 
(log test 101) 2021-04-13 11:05:37,021 - ERROR - Traceback (most recent call last):
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 242, in <module>
(log test 101)     main(args)  # Running main program
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 234, in main
(log test 101)     perform_fixing(args)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 218, in perform_fixing
(log test 101)     fix_sentences(args)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/bifixer.py", line 144, in fix_sentences
(log test 101)     fixed_source = restorative_cleaning.fix(source_sentence, args.srclang, chars_slang, charsRe_slang, punctChars_slang, punctRe_slang)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/restorative_cleaning.py", line 640, in fix
(log test 101)     ftfy_fixed_text = " ".join([ftfy.fix_text_segment(word, fix_entities=True, uncurl_quotes=False, fix_latin_ligatures=False) for word in text.split()])
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/bitextor/bifixer/bifixer/restorative_cleaning.py", line 640, in <listcomp>
(log test 101)     ftfy_fixed_text = " ".join([ftfy.fix_text_segment(word, fix_entities=True, uncurl_quotes=False, fix_latin_ligatures=False) for word in text.split()])
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/lib/python3.8/site-packages/ftfy/__init__.py", line 537, in fix_text_segment
(log test 101)     config = config._replace(**kwargs)
(log test 101)   File "/home/runner/miniconda3/envs/bitextor-installation/lib/python3.8/collections/__init__.py", line 413, in _replace
(log test 101)     raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
(log test 101) ValueError: Got unexpected field names: ['fix_entities']

Seems like ftfy modified the heuristics, so the arguments for fix_text_segment call. Then, fix_entities does not work unless using version 5.9 of ftfy (as forced in https://github.com/bitextor/bifixer/commit/931ba2bc1509921a2a36fb5b8691b0aa5f3577c7).

Proper solution should be using new ftfy calls in Bifixer for a future-proof fix, in case of an urgent version bump for security reasons, for example.

ZJaume commented 3 years ago

It's strange, the changelog says that keyword arguments still work and fix_entities appears in the documentation, this error should not happen ?

ZJaume commented 3 years ago

Anyway, despite the argument being disappeared, it still fixes the entities because all the options are True by default. So, omitting the parameter results in the same behaviour than before. Latest commit should fix the issue.