Open kaihuchen opened 7 years ago
normally, for chinese text processing, there is one more step expected before the modeling: word segment. and it is part of the season why chinese NLP can't always get the same precision/recall with the same model as west languages do.
I am experimenting this code on East Asian languages such as Chinese or Japanese, which do not have apparent word boundary like the white space in Latin languages. Looking into the code I found that batch_loader.py is splitting each line using white space, which I thought essentially makes it impossible to use this code on those languages.