facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

How to search sequences in bulk? #341

Closed 1185307269 closed 2 years ago

1185307269 commented 2 years ago

Dear all: I have a question: How can I do a bulk search sequences in the web version? image Or can I download it through an API?

tomsercu commented 2 years ago

Unfortunately we don't support bulk sequence search, you should use mmseqs directly. We'll upload a fasta file and pre-computed mmseqs db.

1185307269 commented 2 years ago

Thank you very much for your reply! Where can I download the fasta file and the pre-computed mmseqs db?

ebetica commented 2 years ago

I uploaded the high quality fasta here:

s3://dl.fbaipublicfiles.com/esmatlas/v0/highquality_clust30/highquality_clust30.fasta

And mgnify90 here:

s3://dl.fbaipublicfiles.com/esmatlas/v0/full/mgnify90.fasta
1185307269 commented 2 years ago

Thank you very much for your reply!!!!!But I have one more stupid question to ask: what is the difference between these two databases? If I have some sequences to search, which database should I take for searching?

ebetica commented 2 years ago

The high quality is a subset of mgnify90 which is redundancy reduced and well predicted by ESMFold (pTM & pLDDT > 0.7)

tomsercu commented 1 year ago

We now provide https://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta with 617051007 records precisely matching stats.parquet.

See #366 for more context