Store samples as tokens instead of text strings

apartresearch / deepdecipher

🦠 DeepDecipher: An open source API to MLP neurons

https://apartresearch.com

MIT License

9 stars 0 forks source link

Store samples as tokens instead of text strings #4

Open albertsgarde opened 1 year ago

albertsgarde commented 1 year ago

This may not be worth it, but worth looking into if storage space becomes a major issue

albertsgarde commented 1 year ago

This is far more complicated than expected due to many characters being replaced by other characters by the tokenizer. Examples for the SoLU model include ' ' -> 'Ġ' '\n' -> 'Ċ' '\t' -> 'ĉ' '\r' -> 'č'

A small experiment on 100 neuroscope neuron pages showed that it would save roughly 17% on the space required to store the texts.

albertsgarde commented 1 year ago

Further experiments showed that storing activations as 16-bit floats instead of 32-bit floats would save an additional 11 percentage points. Storing token IDs as 16-bit integers instead of 32-bit integers had no effect, likely because the compression algorithm is effective at removing the trailing zeros