castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.67k stars 370 forks source link

get_document_vector() and get_postings_list() Stemming ? #47

Closed poulain-tim closed 4 years ago

poulain-tim commented 4 years ago

Hi @lintool ! I have a new issue : I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

lintool commented 4 years ago

hi @Oulaolay - welcome!

To be clear, you'd want a variant of get_postings_list that takes an already analyzed term, right?

There's actually already an outstanding issue: https://github.com/castorini/anserini/issues/990

I'm not sure when we'll get to it... but you're welcome to send a pull request...

lintool commented 4 years ago

haha, got to it!

poulain-tim commented 4 years ago

Thanks to all these modification ! I try to create a new branch for participating to this project, but it seems i don't have the right to make pull requests. Can you grant me this right ?

The errors that i found are in pyclass.py :

JEnglishStemmingAnalyzer = autoclass('io.anserini.analysis.EnglishStemmingAnalyzerr') will become

JEnglishStemmingAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') and i have an error in this function : "JTokenizeOnlyAnalyzer = autoclass('io.anserini.analysis.TokenizeOnlyAnalyzer')"

File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/TokenizeOnlyAnalyzer' This function isn't present in anserini-0.7.3-fatjar.jar

Thanks !

Best Regards !

chriskamphuis commented 4 years ago

Hi @Oulaolay,

The errors are because of a recent change in Anserini. Pyserini needs to be changed accordingly. I already submitted a PR for this. In order to make a PR you can fork the repository and push to the fork. Then you can create a PR with your fork.

poulain-tim commented 4 years ago

It's perfect ! I'll know next time though.

Have a good day @Chriskamphuis