castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

Possible Issue when Indexing with Llama Tokenizers #2224

Closed Andrwyl closed 1 year ago

Andrwyl commented 1 year ago

Having some possible issues when indexing with CodeLlama tokenizers from local files, environment is the Narval Cluster on Compute Canada.

When running the following script:

python -m pyserini.index.lucene  \
        -collection MrTyDiCollection \
        -generator DefaultLuceneDocumentGenerator \
        -threads 12 \
        -input $collection_dir \
        -index $_index_dir \
        -storePositions -storeRaw -storeDocvectors \
        -analyzeWithHuggingFaceTokenizer [PATH TO LOCAL TOKENIZER] \
        -optimize

when PATH TO LOCAL TOKENIZER is a CodeLlama tokenizer, instruct or otherwise, I get the following error:

2023-10-13 15:41:50,345 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:348) - pool-2-thread-1: Unexpected Exception:
java.lang.NullPointerException: null
        at org.apache.lucene.document.Field.tokenStream(Field.java:486) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1122) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:690) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:575) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:242) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1817) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1470) ~[anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:315) [anserini-0.22.1-SNAPSHOT-fatjar.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2023-10-13 15:41:51,720 WARN  [main] index.IndexCollection (IndexCollection.java:575) - Unexpected difference between number of indexed documents and index maxDoc.

Past experience has shown this error actually wraps any tokenizer loading in error, basically if you pass an invalid tokenizer name (something random even), it triggers the above error, BUT it is entirely likely that it also wraps another error.

Some notes: The local CodeLlama files have no issue. Loading them in via python's transformers AutoTokenizer library works perfectly, and loading them in via a standalone maven project with the djl tokenizer library also works.

This is unique to CodeLlama as well, the same method with T5, BERT, GPT, XLNET all succeed with no issue.

Basically there are two possibilities:

  1. There is something wrong with CodeLlama tokenizers that make them the only tokenizers that fail to be loaded in (but this is strange because as stated above, when loading in via djl tokenizers in a standalone maven project pointing to the same local files, succeeds)
  2. The mysterious Unexpected difference between number of indexed documents and index maxDoc. error hides a different error, which is what actually causes CodeLlama to fail

Does anyone more familiar with the codebase have some insight on this phenomenon?

crystina-z commented 1 year ago

@theyorubayesian by any chance you have a better idea of the possible issue behind loading the Hgf tokenizers?

Andrwyl commented 1 year ago

I've located the real error message that triggers when trying to load in CodeLlama

java.lang.RuntimeException: data did not match any variant of untagged enum NormalizerWrapper at line 85 column 3 when running the line this.tokenizer = HuggingFaceTokenizer.newInstance(path, options);

Hopefully that gives more of an idea of what the issue is to anyone.

Andrwyl commented 1 year ago

Resolved! CodeLlama is incompatible with djl tokenizers 0.21, changing to djl 0.23.0 works