bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

	 introduces tabs in tsv output #4

Closed jelmervdl closed 3 years ago

jelmervdl commented 3 years ago

When running bifixer with the default options (as we do in Paracrawl) it will decode encoded html entities. Including encoded tab characters. This messes up the output, since that's tab separated.

Test case (with tr added on the back to make the tab characters visible)

echo -e "a\tb\tc\tFlavor:	salty & light" | bifixer - - en en | tr '\t' '^'

yields (note the ^ in "Flavor:^salty")

a^b^c^Flavor:^salty & light^d39f54ca9c3055f8^1

I expect html.unescape(..) is to blame (but didn't test that!) Maybe a way to fix it would be to move re.sub(' +', ' ', ..) down a bit and include tabs in that as well, i.e. re.sub('\s+', ' ', ..) (and then you don't need to add the \n to strip() afterwards anymore either)

(Thanks @mksifakis for figuring this out!)

ZJaume commented 3 years ago

The problem comes from ftfy that is fixing unicode and therefore it replaces it by tab. We are still thinking how to deal with this.

https://www.codetable.net/decimal/9