introduces tabs in tsv output

When running bifixer with the default options (as we do in Paracrawl) it will decode encoded html entities. Including encoded tab characters. This messes up the output, since that's tab separated.

Test case (with tr added on the back to make the tab characters visible)

echo -e "a\tb\tc\tFlavor:&#9;salty & light" | bifixer - - en en | tr '\t' '^'

yields (note the ^ in "Flavor:^salty")

a^b^c^Flavor:^salty & light^d39f54ca9c3055f8^1

I expect html.unescape(..) is to blame (but didn't test that!) Maybe a way to fix it would be to move re.sub(' +', ' ', ..) down a bit and include tabs in that as well, i.e. re.sub('\s+', ' ', ..) (and then you don't need to add the \n to strip() afterwards anymore either)

(Thanks @mksifakis for figuring this out!)

bitextor / bifixer

introduces tabs in tsv output #4

bitextor / bifixer

&#9; introduces tabs in tsv output #4

introduces tabs in tsv output #4