facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

Average length of pretraining protein sequences #173

Closed Yijia-Xiao closed 2 years ago

Yijia-Xiao commented 2 years ago

Hi! Thanks for the great work. I have a question regarding MSA Transformer. The paper provided the distribution of MSA depths of the training set. However, the length distribution of protein is not provided.

So I am wondering whether it is possible to disclose some statistics about the length distribution?

Thanks!

Best, Yijia Xiao

tomsercu commented 2 years ago

The lengths of the proteins are straightforwardly found in the uniref50 database, which sequences were used as seed seqs to construct the MSAs. We used this version, quite outdated now but I would guess the length statistics won't change so much: https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2018_03/uniref/

Yijia-Xiao commented 2 years ago

Got it, thank you :-) @tomsercu