Add tokenizer - Githubissues

Koeng101 commented 4 months ago

This PR is creating a tokenizer in the dnadesign lib. This is primarily for tokenizing amino acids for consumption of an LLM - in particular, llm.c.

Koeng101 commented 3 months ago

I'd like to make the shard-writer to be a little smaller, and more specific to just receive tokens and write em. Maybe as a concurrent process.

I want to be able to encode pfam in the lead-up to peptides. [PFAM][AA seq][EOS]. The idea here is that you could throw a PFAM to predict the next tokens.

Koeng101 commented 3 months ago

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

In uniref90 there are roughly 65B tokens. Encoded as uint8, that's like 60GB, plus I bet I could shave off a little if I zstd encoded it.

Koeng101 / dnadesign

Add tokenizer #78