Consider interop with huggingface's tokenizers

The right place for huggingface's tokenizers would be after we do our own math-aware preprocessing and wouldn't really play any part in serializing a "token model" plain text file, which is the current endpoint of using llamapun. Once that plain text is read-in for a specific modeling framework, it needs to be retokenized as per the model requirements (e.g. 2 million distinct tokens if one uses GloVe/word2vec for arXiv, but only 30 thousand tokens if one uses subword tokenization). So huggingface's tokenization is probably a step to use after one is done with preprocessing via llamapun.

And as things stand, the huggingface/keras ecosystem (or some of their competitors) are so convenient that llamapun should really act as a math-aware preprocessing library, and leave off the actual modeling to something else.

KWARC / llamapun

Consider interop with huggingface's tokenizers #48