Protein data - Githubissues

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

It also implies that metagenomic isn't actually that important. Fascinating.

Koeng101 / dnadesign

Protein data #79