castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.01k stars 444 forks source link

Fix tokenizer length issue with DJL upgrade #2536

Closed lintool closed 2 months ago

lintool commented 2 months ago

Another take for #2535

After digging through source code: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizer.java

I was able to override the tokenize arbitrarily long sequences.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 67.18%. Comparing base (98e4866) to head (6185321).

Files Patch % Lines
...n/java/io/anserini/analysis/CompositeAnalyzer.java 66.66% 0 Missing and 1 partial :warning:
...nserini/analysis/HuggingFaceTokenizerAnalyzer.java 66.66% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #2536 +/- ## ============================================ + Coverage 67.14% 67.18% +0.03% - Complexity 1479 1481 +2 ============================================ Files 219 219 Lines 12641 12645 +4 Branches 1528 1528 ============================================ + Hits 8488 8495 +7 + Misses 3625 3624 -1 + Partials 528 526 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ToluClassics commented 2 months ago

This works, alternatively we could just set the truncation flag to "DO NOT TRUNCATE"

options.put("truncation", "DO_NOT_TRUNCATE");

Ref: https://github.com/deepjavalibrary/djl/blob/cac16a1918af9b033b62c2bd51acd0f1cabd8a30/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizer.java#L556

lintool commented 2 months ago

I've confirmed that all these affected regressions pass:

python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ar-aca > logs/log.miracl-v1.0-ar-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-bn-aca > logs/log.miracl-v1.0-bn-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-en-aca > logs/log.miracl-v1.0-en-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-es-aca > logs/log.miracl-v1.0-es-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fa-aca > logs/log.miracl-v1.0-fa-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fi-aca > logs/log.miracl-v1.0-fi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fr-aca > logs/log.miracl-v1.0-fr-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-hi-aca > logs/log.miracl-v1.0-hi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-id-aca > logs/log.miracl-v1.0-id-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ja-aca > logs/log.miracl-v1.0-ja-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ko-aca > logs/log.miracl-v1.0-ko-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ru-aca > logs/log.miracl-v1.0-ru-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-sw-aca > logs/log.miracl-v1.0-sw-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-te-aca > logs/log.miracl-v1.0-te-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-th-aca > logs/log.miracl-v1.0-th-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-zh-aca > logs/log.miracl-v1.0-zh-aca.txt 2>&1

python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ar-aca > logs/log.mrtydi-v1.1-ar-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-bn-aca > logs/log.mrtydi-v1.1-bn-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-en-aca > logs/log.mrtydi-v1.1-en-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-fi-aca > logs/log.mrtydi-v1.1-fi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-id-aca > logs/log.mrtydi-v1.1-id-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ja-aca > logs/log.mrtydi-v1.1-ja-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ko-aca > logs/log.mrtydi-v1.1-ko-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ru-aca > logs/log.mrtydi-v1.1-ru-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-sw-aca > logs/log.mrtydi-v1.1-sw-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-te-aca > logs/log.mrtydi-v1.1-te-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-th-aca > logs/log.mrtydi-v1.1-th-aca.txt 2>&1

python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.wp-ca > logs/log.msmarco-v1-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc.wp-ca > logs/log.msmarco-v1-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc-segmented.wp-ca > logs/log.msmarco-v1-doc-segmented.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.wp-ca > logs/log.dl19-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc.wp-ca > logs/log.dl19-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented.wp-ca > logs/log.dl19-doc-segmented.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage.wp-ca > logs/log.dl20-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc.wp-ca > logs/log.dl20-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented.wp-ca > logs/log.dl20-doc-segmented.wp-ca.txt 2>&1

python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.wp-hgf > logs/log.msmarco-v1-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc.wp-hgf > logs/log.msmarco-v1-doc.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.wp-hgf > logs/log.dl19-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc.wp-hgf > logs/log.dl19-doc.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage.wp-hgf > logs/log.dl20-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc.wp-hgf > logs/log.dl20-doc.wp-hgf.txt 2>&1