handle non-unicode characters - Githubissues

georgid / AlignmentDuration

Lyrics-to-audio-alignement system. Based on Machine Learning Algorithms: Hidden Markov Models with Viterbi forced alignment. The alignment is explicitly aware of durations of musical notes. The phonetic model are classified with MLP Deep Neural Network.

http://mtg.upf.edu/node/3751

GNU Affero General Public License v3.0

56 stars 6 forks source link

handle non-unicode characters #61

Closed georgid closed 6 years ago

georgid commented 6 years ago

1) when decoding characters, make sure their meaning is not lost (now they are ignored as a workaround here) but they are not showed in the .lab output e.g. try different encodings or try to guess encoding http://unicodebook.readthedocs.io/guess_encoding.html

2) make sure they are encoded properly here

maybe use .encode('utf-8').strip() instead of str()

georgid commented 6 years ago

def isUTF8(data):
    try:
        data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        return True

georgid commented 6 years ago

This code might be useful, too:

#     s = list(words_ortho)

        #@@@ combine two-char diacritics: 
        # TODO: not optimal has too loop in word for each diacritic type 

#         # turkish diaeresis
#         s = combineDiacriticsChars(s, u'\u0308')
#          
#         # telugu macron
#         s = combineDiacriticsChars(s, u'\u0304')
#          
#         # telugu acute
#         s = combineDiacriticsChars(s, u'\u0301') 
#          
#         # telugu dot below
#         s = combineDiacriticsChars(s, u'\u0323')                      
#          
#         # telugu dot above
#         s = combineDiacriticsChars(s, u'\u0307')

georgid commented 6 years ago

This is true for any issues with accute , etc. accents like in spanish and french. Convert letters with such accents to the same letter without the accent.

georgid commented 6 years ago

The UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) is solved by removing the /usr/local/lib/python2.7/site-packages/mir_eval-0.4-py2.7.egg-info and leaving only the /usr/local/lib/python2.7/site-packages/mir_eval-0.3-py2.7.egg-info. The 0.3 is installed correctly (has a file installed-files.txt) unlike the 0.4 version

georgid commented 6 years ago

/Users/joro/Documents/VOICE_magix/smule/dataset/692653830_3071180/timed_lyrics.txt has á on phrase ‘no más’

georgid commented 6 years ago

https://community.esri.com/thread/149400

georgid commented 6 years ago

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 5: ordinal not in range(128)

georgid commented 6 years ago

Decode error solved in commit by assuming encoding latin-1 and replacing manually all accents, macron etc. by their repsective character without them e.g. á is replaced by a . Since there is no spanish dict , this results in e.g. más becoming mas and then replaced by closest english word mask.

Read this for full understanding of unicode in python 3.

TODO: represent all diacritics by their sign , so that we do not need to handle manually all cases. as in given code on 5th January above.