facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.58k stars 460 forks source link

Trying to get Japanese tokenization to work #54

Open ajkl opened 5 years ago

ajkl commented 5 years ago

I got the mecab setup in the right location as mentioned in the docs. But I am not able to get the japanese tokenization working. Anyone seen this before ?

!echo "雪の風景" | python3 ./LASER/source/embed.py \
    --encoder ./LASER/models/bilstm.93langs.2018-12-26.pt \
    --token-lang ja \
    --bpe-codes ./LASER/models/93langs.fcodes \
    --output /data/LASER/LASER-embeddings/jp-titles.vec \
    --verbose \- Encoder: loading ./LASER/models/bilstm.93langs.2018-12-26.pt
- Encoder: loading ./LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language ja  
WARNING: No known abbreviations for language 'ja', attempting fall-back to English version...
Traceback (most recent call last):
  File "/home/ubuntu/projects/LASER/source/lib/romanize_lc.py", line 46, in <module>
    for line in args.input:
  File "/home/ubuntu/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
 - fast BPE: processing tok
 - Encoder: bpe to jp-titles.vec
 - Encoder: 0 sentences in 0s
CPU times: user 68 ms, sys: 116 ms, total: 184 ms
Wall time: 3.44 s
ajkl commented 5 years ago

Still facing the same issue

ghost commented 5 years ago

I installed mecab like this:

git clone https://github.com/taku910/mecab && \
    cd mecab/mecab && \
    ./configure --enable-utf8-only && \
    make && \
    make check && \
    make install && \
    ldconfig && \
    cd ../mecab-ipadic && \
    ./configure --with-charset=utf8 && \
    make && \
    make install

after that, rewrote TokenLine and Token function in source/lib/text_processing.py:

#+ ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
+ ('| mecab -O wakati -b 50000 ' if lang == 'ja' else '')  #run mecab command

and got this:

echo "雪の風景" | python laser/source/embed.py --encoder laser/models/bilstm.93langs.2018-12-26.pt --token-lang ja --bpe-codes laser/models/93langs.fcodes --output jptest.vec --verbose
 - Encoder: loading laser/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language ja
WARNING: No known abbreviations for language 'ja', attempting fall-back to English version...
 - fast BPE: processing tok
 - Encoder: bpe to jptest.vec
 - Encoder: 1 sentences in 0s

I think --enable-utf8-only and --with-charset=utf8 options are important to compile mecab.

MastafaF commented 5 years ago

Hey, I am using the docker for the Japanese representation but I have no idea why it is not working properly. It outputs no embeddings for Japanese sentences: Example:

params_ja = {"q": str(sent_ja), "lang": "ja"} # sent_ja is a **sentence in Japanese** 
resp_ja = requests.get(url=url, params=params_ja).json()
print(resp_ja) 

Output: {'content': sent_ja, 'embedding': []}

Any idea on how to solve that?