Investigate speeding up the movie review model

In PR #773 some changes were made to the movie reviews model runner. We apparently had two versions:

One that runs the input sentences through the tokenizer and model one-by-one
One that runs all inputs through the tokenizer, then all through the model at once.

The second version is faster, but fails on some special-character inputs so we're using the first now. It would be good if we could fix the second version and use that instead.

The reason for the errors is that all inputs (after masking + tokenization) given to the model in one go needs to have the same length. Some combinations of special characters result in a different amount of tokens and hence a crash. The reason for the differing amount of tokens has to do with the tokenizer, it's best illustrated by running the tokenizer with a few (masked) inputs containing special characters. I'm not sure we can completely fix it at the tokenizer level.

One solution might be to pad the inputs to all match the length of the longest one. There is a special padding token that can be used for this. The question is how this affects the output. I.e., is running the model sentence-by-sentence the same as running all sentences in one go but with added padding? This is to be investigated before it is implemented in DIANNA. A similar padding function was used during model training, see here and here.

Note that the implementation is in several locations: the lime text tutorial, rise text tutorial, _movie_model.py in the dashboard and finally tests/utils.py

dianna-ai / dianna

Investigate speeding up the movie review model #775