When running bifixer with the default options (as we do in Paracrawl) it will decode encoded html entities. Including encoded tab characters. This messes up the output, since that's tab separated.
Test case (with tr added on the back to make the tab characters visible)
echo -e "a\tb\tc\tFlavor:	salty & light" | bifixer - - en en | tr '\t' '^'
yields (note the ^ in "Flavor:^salty")
a^b^c^Flavor:^salty & light^d39f54ca9c3055f8^1
I expect html.unescape(..) is to blame (but didn't test that!) Maybe a way to fix it would be to move re.sub(' +', ' ', ..) down a bit and include tabs in that as well, i.e. re.sub('\s+', ' ', ..) (and then you don't need to add the \n to strip() afterwards anymore either)
When running bifixer with the default options (as we do in Paracrawl) it will decode encoded html entities. Including encoded tab characters. This messes up the output, since that's tab separated.
Test case (with
tr
added on the back to make the tab characters visible)yields (note the
^
in "Flavor:^salty")I expect
html.unescape(..)
is to blame (but didn't test that!) Maybe a way to fix it would be to movere.sub(' +', ' ', ..)
down a bit and include tabs in that as well, i.e.re.sub('\s+', ' ', ..)
(and then you don't need to add the \n to strip() afterwards anymore either)(Thanks @mksifakis for figuring this out!)