ValueError: invalid literal for int() with base 10: '\xe3\x82\xb3\xe3\x83\xac'

KentoW / melody-lyrics

All source URLs of the 1,000 songs for creating melody-lyric alignment data.

15 stars 5 forks source link

Adding annotator tokenize Adding annotator ssplit Adding annotator pos Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec]. Adding annotator lemma INFO:CoreNLP_JavaServer: CoreNLP pipeline initialized. INFO:CoreNLP_JavaServer: Waiting for commands on stdin iconv_open is not supported Traceback (most recent call last): File "align_data_json.py", line 669, in <module> main(args) File "align_data_json.py", line 661, in main data = make_data(dir_name) File "align_data_json.py", line 483, in make_data song_text = open_text(lyrics_file) File "align_data_json.py", line 439, in open_text for morph in parse(line.encode('utf-8')): File "align_data_json.py", line 65, in parse phrase_lyrics = get_phrase(phrase, phrase_info) File "align_data_json.py", line 80, in get_phrase accent = get_accent("".join([w.split("\t")[0] for w in phrase])) File "align_data_json.py", line 201, in get_accent acc_position = int(morph.split("\t")[7].split(",")[0]) ValueError: invalid literal for int() with base 10: '\xe3\x82\xb3\xe3\x83\xac'

An error raised when I ran python align_data_json.py > data.jsonl.

Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
INFO:CoreNLP_JavaServer: CoreNLP pipeline initialized.
INFO:CoreNLP_JavaServer: Waiting for commands on stdin
iconv_open is not supported
Traceback (most recent call last):
  File "align_data_json.py", line 669, in <module>
    main(args)
  File "align_data_json.py", line 661, in main
    data = make_data(dir_name)
  File "align_data_json.py", line 483, in make_data
    song_text = open_text(lyrics_file)
  File "align_data_json.py", line 439, in open_text
    for morph in parse(line.encode('utf-8')):
  File "align_data_json.py", line 65, in parse
    phrase_lyrics = get_phrase(phrase, phrase_info)
  File "align_data_json.py", line 80, in get_phrase
    accent = get_accent("".join([w.split("\t")[0] for w in phrase]))
  File "align_data_json.py", line 201, in get_accent
    acc_position = int(morph.split("\t")[7].split(",")[0])
ValueError: invalid literal for int() with base 10: '\xe3\x82\xb3\xe3\x83\xac'

I write morph.split("\t") to file, which is ['\xe3\x81\x93\xe3\x82\x8c', '\xe3\x82\xb3\xe3\x83\xac', '\xe3\x82\xb3\xe3\x83\xac', '\xe6\xad\xa4\xe3\x82\x8c', '\xe4\xbb\xa3\xe5\x90\x8d\xe8\xa9\x9e', '', '', '\xe3\x82\xb3\xe3\x83\xac', '0', '']. I have no idea what to do next to fix this error.

If I use unidic's origin dicrc, this error disappeared, i.e., undo this command mv dic/dicrc dic/unidic/ in readme.

KentoW / melody-lyrics

ValueError: invalid literal for int() with base 10: '\xe3\x82\xb3\xe3\x83\xac' #1