KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Consider interop with huggingface's tokenizers #48

Closed dginev closed 3 years ago

dginev commented 4 years ago

Huggingface maintain a Rust tokenization library compatible with their language model pipelines. It should be worth investigating how to interoperate with that experimental flow, as well as to see if I can leverage their approach for a Python wrapper also for the llamapun abstractions.

dginev commented 3 years ago

The right place for huggingface's tokenizers would be after we do our own math-aware preprocessing and wouldn't really play any part in serializing a "token model" plain text file, which is the current endpoint of using llamapun. Once that plain text is read-in for a specific modeling framework, it needs to be retokenized as per the model requirements (e.g. 2 million distinct tokens if one uses GloVe/word2vec for arXiv, but only 30 thousand tokens if one uses subword tokenization). So huggingface's tokenization is probably a step to use after one is done with preprocessing via llamapun.

And as things stand, the huggingface/keras ecosystem (or some of their competitors) are so convenient that llamapun should really act as a math-aware preprocessing library, and leave off the actual modeling to something else.