Open ajkl opened 5 years ago
Still facing the same issue
I installed mecab like this:
git clone https://github.com/taku910/mecab && \
cd mecab/mecab && \
./configure --enable-utf8-only && \
make && \
make check && \
make install && \
ldconfig && \
cd ../mecab-ipadic && \
./configure --with-charset=utf8 && \
make && \
make install
after that, rewrote TokenLine
and Token
function in source/lib/text_processing.py
:
#+ ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
+ ('| mecab -O wakati -b 50000 ' if lang == 'ja' else '') #run mecab command
and got this:
echo "雪の風景" | python laser/source/embed.py --encoder laser/models/bilstm.93langs.2018-12-26.pt --token-lang ja --bpe-codes laser/models/93langs.fcodes --output jptest.vec --verbose
- Encoder: loading laser/models/bilstm.93langs.2018-12-26.pt
- Tokenizer: in language ja
WARNING: No known abbreviations for language 'ja', attempting fall-back to English version...
- fast BPE: processing tok
- Encoder: bpe to jptest.vec
- Encoder: 1 sentences in 0s
I think --enable-utf8-only
and --with-charset=utf8
options are important to compile mecab.
Hey, I am using the docker for the Japanese representation but I have no idea why it is not working properly. It outputs no embeddings for Japanese sentences: Example:
params_ja = {"q": str(sent_ja), "lang": "ja"} # sent_ja is a **sentence in Japanese**
resp_ja = requests.get(url=url, params=params_ja).json()
print(resp_ja)
Output:
{'content': sent_ja, 'embedding': []}
Any idea on how to solve that?
I got the mecab setup in the right location as mentioned in the docs. But I am not able to get the japanese tokenization working. Anyone seen this before ?