castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.01k stars 444 forks source link

Tweak regression scores due to DJL upgrade #2535

Closed lintool closed 2 months ago

lintool commented 2 months ago

Recent upgrade to DJL v0.28.0 #2529 caused a bunch of score differences related to underlying use of hgf tokenizers.

Documented in code, but extracting here for visibility:

    // Note upgrading from djl v0.21.0 to v0.28.0 (June 2024)
    //
    // In theory, since we're just tokenizing, we shouldn't be constrained by the modelMaxLength.
    // Previously, at v0.21.0, we were able to tokenize arbitrarily long sequences.
    // However, the implementation seems to have changed.
    //
    // As of the v0.28.0 upgrade, if we put a large value, we get the warning:
    // "maxLength is greater then (sic) modelMaxLength, change to: 512"
    //
    // On the other hand, if we don't set this value, we get the warning:
    // "maxLength is not explicitly specified, use modelMaxLength: 512".
    //
    // In other words, the implementation forces truncation, even for our IR application, i.e., it
    // forces FirstP retrieval.
codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 67.17%. Comparing base (98e4866) to head (f5d624f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #2535 +/- ## ============================================ + Coverage 67.14% 67.17% +0.02% - Complexity 1479 1481 +2 ============================================ Files 219 219 Lines 12641 12643 +2 Branches 1528 1528 ============================================ + Hits 8488 8493 +5 + Misses 3625 3624 -1 + Partials 528 526 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

theyorubayesian commented 2 months ago

This is interesting. Does this happen for tokenizers trained & stored locally and tokenizers attached to models on HF?

lintool commented 2 months ago

This is interesting. Does this happen for tokenizers trained & stored locally and tokenizers attached to models on HF?

  • Tagging @ToluClassics for thoughts.

Not sure. In this PR are all the scores that changed. Everything else didn't.

theyorubayesian commented 2 months ago

I took a look and tested it with the GPT-2 tokenizer. The DJL implementation doesn't use the modelMaxLength, even if it is set in the tokenizer_config.json.

lintool commented 2 months ago

Supersede by #2536 which is the better solution.