castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.67k stars 371 forks source link

Generalize tokenize_json_collection to non-English languages #454

Closed lintool closed 2 years ago

lintool commented 3 years ago

Let's take this: https://github.com/castorini/pyserini/blob/master/pyserini/tokenize_json_collection.py

And make sure it works for non-English languages - e.g., XLMR, mBERT, etc.

cc/ @keleog

lintool commented 2 years ago

@crystina-z can we close this issue now? seems to have been done?

crystina-z commented 2 years ago

yea agreed