bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Long sentences are not being removed apparently #17

Closed cgr71ii closed 1 year ago

cgr71ii commented 1 year ago

Hi!

Either monofixer or bifixer should remove long sentences when the number of words is greater than 5000: https://github.com/bitextor/bifixer/blob/1a91e3eb47b2c4e7de9d6812fe25ed1bd5f4e9d4/bifixer/monofixer.py#L195 https://github.com/bitextor/bifixer/blob/1a91e3eb47b2c4e7de9d6812fe25ed1bd5f4e9d4/bifixer/bifixer.py#L215

The problem is that, apparently, it seems that it is not working:

pip3 install bifixer==0.8.3
# monofixer

python -c "print('asd'); print(' '.join(['a']*6000)); print('asd')" \
  | monofixer --scol 1 --ignore_duplicates  -q - - es \
  | wc -w
# 6002

python -c "print('asd'); print(' '.join(['a']*6000)); print('asd')" \
  | monofixer --scol 1 --ignore_duplicates --ignore_long -q - - es \
  | wc -w
# 6002
# bifixer

python -c "print('asd\tasd'); print('asd\t' + ' '.join(['a']*6000)); print('asd\tasd')" \
  | bifixer --scol 1 --tcol 2 --ignore_duplicates  -q - - en es \
  | wc -w
# 6005

python -c "print('asd\tasd'); print('asd\t' + ' '.join(['a']*6000)); print('asd\tasd')" \
  | bifixer --scol 1 --tcol 2 --ignore_long --ignore_duplicates  -q - - en es \
  | wc -w
# 6005

Am I doing something wrong?

Thank you!

mbanon commented 1 year ago

Long sentences are not being removed, they are just ignored (not processed, but outputted).

It's not correct at the documentation, I'm fixing it.

cgr71ii commented 1 year ago

Oh! Ok, thank you!