castorini / castor

PyTorch deep learning models for text processing
http://castor.ai/
Apache License 2.0
178 stars 58 forks source link

add separator #167

Closed Victor0118 closed 5 years ago

Victor0118 commented 5 years ago

@daemon Could you take a look at this PR?

When we generate the raw text we split them by " ". To count the word number we should keep the separator consistent. I find I will get different word counts between .split() and .split(" ").

Victor0118 commented 5 years ago

Some samples

>>> "test a ".split()
['test', 'a']
>>> "test a ".split(" ")
['test', 'a', '']