alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Add a Python test and installation guide #3

Closed kerighan closed 1 year ago

kerighan commented 1 year ago

Thanks for your work. Consider adding some Python tests and uploading a package on PyPi that works out-of-the-box. This is crucial for potential adoption. Adding some cython or pybind11 bindings may also get the code to work faster.

alasdairforsythe commented 1 year ago

Thanks for your advice. I've finished the ungreedy version and I'm going to leave that running for a couple of days to build the vocabs. Then I'll update the Javascript test to work with that. Unfortunately that'll be a lot of work because it'll mean I have to do a javascript version of capcode, which turned out to be fairly complicated. Then I'll do the tokenizing and detokenizing and capcode encoding and decoding in Python. The actual training will stay as command-line executables written in Go, that'd be too slow to do in Python, and it relies upon a 10,000 line datastructure I'd previously written in Go, so not much chance the training will ever come out of Go.