carpedm20 / lstm-char-cnn-tensorflow

in progress
MIT License
761 stars 243 forks source link

Problem testing on East Asian languages #16

Open kaihuchen opened 7 years ago

kaihuchen commented 7 years ago

I am experimenting this code on East Asian languages such as Chinese or Japanese, which do not have apparent word boundary like the white space in Latin languages. Looking into the code I found that batch_loader.py is splitting each line using white space, which I thought essentially makes it impossible to use this code on those languages.

Victoriayhk commented 7 years ago

normally, for chinese text processing, there is one more step expected before the modeling: word segment. and it is part of the season why chinese NLP can't always get the same precision/recall with the same model as west languages do.