Open albertsgarde opened 1 year ago
This is far more complicated than expected due to many characters being replaced by other characters by the tokenizer. Examples for the SoLU model include ' ' -> 'Ġ' '\n' -> 'Ċ' '\t' -> 'ĉ' '\r' -> 'č'
A small experiment on 100 neuroscope neuron pages showed that it would save roughly 17% on the space required to store the texts.
Further experiments showed that storing activations as 16-bit floats instead of 32-bit floats would save an additional 11 percentage points. Storing token IDs as 16-bit integers instead of 32-bit integers had no effect, likely because the compression algorithm is effective at removing the trailing zeros
This may not be worth it, but worth looking into if storage space becomes a major issue