bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Output file encoding should be set to UTF-8 #19

Closed rxzhangGH closed 1 year ago

rxzhangGH commented 1 year ago

Hi,

When using the bifixer command line tool, I noticed an issue with line 53 of bifixer.py:

parser.add_argument('output', type=argparse.FileType('w'), default=sys.stdout, help="Fixed corpus")

Since no encoding is specified in the type, a platform-specific encoding will be used and that caused problems for me. I suggest changing the above to:

parser.add_argument('output', type=argparse.FileType('w', encoding='UTF-8'), default=sys.stdout, help="Fixed corpus")

ZJaume commented 1 year ago

Could you please post the error? The OS and Python would be helpful, also.

ZJaume commented 1 year ago

It's fixed anyway now.