dianna-ai / dianna

Deep Insight And Neural Network Analysis
https://dianna.readthedocs.io
Apache License 2.0
44 stars 13 forks source link

Investigate speeding up the movie review model #775

Open loostrum opened 1 month ago

loostrum commented 1 month ago

In PR #773 some changes were made to the movie reviews model runner. We apparently had two versions:

The second version is faster, but fails on some special-character inputs so we're using the first now. It would be good if we could fix the second version and use that instead.

The reason for the errors is that all inputs (after masking + tokenization) given to the model in one go needs to have the same length. Some combinations of special characters result in a different amount of tokens and hence a crash. The reason for the differing amount of tokens has to do with the tokenizer, it's best illustrated by running the tokenizer with a few (masked) inputs containing special characters. I'm not sure we can completely fix it at the tokenizer level.

One solution might be to pad the inputs to all match the length of the longest one. There is a special padding token that can be used for this. The question is how this affects the output. I.e., is running the model sentence-by-sentence the same as running all sentences in one go but with added padding? This is to be investigated before it is implemented in DIANNA. A similar padding function was used during model training, see here and here.

Note that the implementation is in several locations: the lime text tutorial, rise text tutorial, _movie_model.py in the dashboard and finally tests/utils.py

elboyran commented 1 month ago

I can confirm that now, dianna can handle special characters, but the text tutorials run very slowly.