Helsinki-NLP / Tatoeba-Challenge

Other
809 stars 91 forks source link

encdoing issue #31

Closed 106AbdulBasit closed 1 year ago

106AbdulBasit commented 1 year ago

I have downloaded the test .text file from the below link, which is basically the test set of eng- urdu,

link

The file has the following like pairs

Franco has blue jeans. فرانکو کے پاس نیلی جینز ہے۔ لیون کے پاس نیلا جینز ہے.

which encoding will be used, I have used "utf-8", "cp1256", "iso-8859-6" and many more, not able to process these lines,

jorgtied commented 1 year ago

It should be in UTF-8.

106AbdulBasit commented 1 year ago

Surprisingly when the file was opened in the Mac it was doing well .