amansrivastava17 / embedding-as-service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
MIT License
204 stars 29 forks source link

UTF-8 Encoding For Glove Embedding #44

Closed prasys closed 4 years ago

prasys commented 4 years ago

Problem Statement : It looks like UTF-8 isn't being handled in Windows. By default , Windows uses Windows 1252 encoding , https://en.wikipedia.org/wiki/Windows-1252

Why does it happen This will cause the 'UnicodeDecodeError: 'charmap' codec can't decode byte 0x90' in Windows when you run Glove embedding and there are some UTF-8 words which windows cannot find it. Hence , the way to read the glove file is to make it explicit that it is UTF-8

what's the fix To make the file opening as explicit UTF-8 to handle this in Windows. No side effect on OSX/Linux (as I've tested it both)

prasys commented 4 years ago

Thanks a lot @amansrivastava17 . I appreciate the work that you and your team have put into in making an easy service for people to easily use it.

Cheers 👍

amansrivastava17 commented 4 years ago

@all-contributors please add prasys for bug fix related UTF-8 Encoding For Glove Embedding

allcontributors[bot] commented 4 years ago

@amansrivastava17

I've put up a pull request to add @prasys! :tada: