apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.19k stars 1.14k forks source link

unicode error when using `text_analytics.count_words` #1378

Open judyboon opened 5 years ago

judyboon commented 5 years ago

Having observed an error code when using text_analytics.count_words to process a SArray. The SArray is like follows:

dtype: str
Rows: 5
['ニュースレター会員の皆様、ホテル最大 50% OFF のチャンスをお見逃しなく ! セールは今夜終了 !', 'Save up to 50%', 'Bespaar 50% - De helft van de prijs op vrijdag!', 'Mes notes du 26 déc., à 2 h 26', '[48 hours only] Save up to 50%!']

When running

turicreate.text_analytics.count_words(my_sarray, to_lower=True, delimiters=None)

it gives me following error.

*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte

TC version: 5.1 python version: 3.6.5

davidswaven commented 5 years ago

similar issue with some Serbian characters :

tc.text_analytics.count_ngrams(tc.SArray(['Tuširanje']), method='character')

fails with

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 1: unexpected end of data

because of char š apparently not supported.

interestingly

tc.text_analytics.count_ngrams( tc.SArray([u'Tuširanje']), method='word')

does not fail but returns empty dict

dtype: dict Rows: 1 [{}]

TobyRoseman commented 4 years ago

This issue still reproduce with TuriCreate 6.4 (on macOS 10.15 and Python 3.7).