Koeng101 / dnadesign

A Go package for designing DNA.
Other
23 stars 0 forks source link

Protein data #79

Closed Koeng101 closed 2 months ago

Koeng101 commented 2 months ago

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

It also implies that metagenomic isn't actually that important. Fascinating.

Koeng101 commented 2 months ago

Specifically, I'm thinking about how to distribute tokenized protein data. But from this, it doesn't look like that is necessary. So I'm closing.