Closed georgid closed 6 years ago
def isUTF8(data):
try:
data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
return True
This code might be useful, too:
# s = list(words_ortho)
#@@@ combine two-char diacritics:
# TODO: not optimal has too loop in word for each diacritic type
# # turkish diaeresis
# s = combineDiacriticsChars(s, u'\u0308')
#
# # telugu macron
# s = combineDiacriticsChars(s, u'\u0304')
#
# # telugu acute
# s = combineDiacriticsChars(s, u'\u0301')
#
# # telugu dot below
# s = combineDiacriticsChars(s, u'\u0323')
#
# # telugu dot above
# s = combineDiacriticsChars(s, u'\u0307')
This is true for any issues with accute , etc. accents like in spanish and french. Convert letters with such accents to the same letter without the accent.
The UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) is solved by removing the /usr/local/lib/python2.7/site-packages/mir_eval-0.4-py2.7.egg-info and leaving only the /usr/local/lib/python2.7/site-packages/mir_eval-0.3-py2.7.egg-info. The 0.3 is installed correctly (has a file installed-files.txt) unlike the 0.4 version
/Users/joro/Documents/VOICE_magix/smule/dataset/692653830_3071180/timed_lyrics.txt has á on phrase ‘no más’
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)
Decode error solved in commit by assuming encoding latin-1 and replacing manually all accents, macron etc. by their repsective character without them e.g. á is replaced by a . Since there is no spanish dict , this results in e.g. más becoming mas and then replaced by closest english word mask.
Read this for full understanding of unicode in python 3.
TODO: represent all diacritics by their sign , so that we do not need to handle manually all cases. as in given code on 5th January above.
1) when decoding characters, make sure their meaning is not lost (now they are ignored as a workaround here) but they are not showed in the .lab output e.g. try different encodings or try to guess encoding http://unicodebook.readthedocs.io/guess_encoding.html
2) make sure they are encoded properly here
maybe use
.encode('utf-8').strip()
instead ofstr()