forrestdavis / NLPScholar

Tools for training an NLP Scholar
GNU General Public License v3.0
5 stars 2 forks source link

More roboustly handle max sequence length and special tokens #9

Open forrestdavis opened 1 week ago

forrestdavis commented 1 week ago

Issue

Validate maximum context length is handled properly

Motivation

At the moment, for contexts larger than the maximum allowed for a fixed length model, the code only partially addresses required special tokens like [CLS] and [SEP]. It would be good to systematically look for a variety of both masked and causal models which require things like beginning of sentence and end of sentence tokens and ensure that the code works for them properly.

Your contribution

Demonstrate that the current approach works for a variety of models. You should look at the by_token_predictability functions in src/models/hf_causal_model.py and src/models/hf_masked_model.py.