Closed dginev closed 3 years ago
The right place for huggingface's tokenizers would be after we do our own math-aware preprocessing and wouldn't really play any part in serializing a "token model" plain text file, which is the current endpoint of using llamapun. Once that plain text is read-in for a specific modeling framework, it needs to be retokenized as per the model requirements (e.g. 2 million distinct tokens if one uses GloVe/word2vec for arXiv, but only 30 thousand tokens if one uses subword tokenization). So huggingface's tokenization is probably a step to use after one is done with preprocessing via llamapun.
And as things stand, the huggingface/keras ecosystem (or some of their competitors) are so convenient that llamapun should really act as a math-aware preprocessing library, and leave off the actual modeling to something else.
Huggingface maintain a Rust tokenization library compatible with their language model pipelines. It should be worth investigating how to interoperate with that experimental flow, as well as to see if I can leverage their approach for a Python wrapper also for the llamapun abstractions.