castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

Support for jsonl.gz input in pyserini.encode #1908

Open ftvalentini opened 1 month ago

ftvalentini commented 1 month ago

Would be nice to add support for reading jsonl.gz files when encoding a corpus with a dense encoder with python -m pyserini.encode, in:

https://github.com/castorini/pyserini/blob/b7e1da305dd31b195244d49321087505996260c6/pyserini/encode/_base.py#L133

Maybe with:

#...
open_handle = gzip.open if filename.endswith(".gz") else open
with open_handle(filename) as f: 
#...

In this way both pyserini.index.lucene and pyserini.encode accept jsonl.gz files as input.