facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

ESMAtlas api access: Rate limit? #375

Closed fteufel closed 1 year ago

fteufel commented 2 years ago

Hi,

I would like to automatically retrieve structures from ESMAtlas. Basically I have a list of MGinfy IDs that I want to retrieve.

I do that using the following:

def get_esm_pdb_file(mgnify_id, out_dir):
    url = f'https://api.esmatlas.com/fetchPredictedStructure/{mgnify_id}.pdb'
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        pdb_file = os.path.join(out_dir,f'{mgnify_id}.pdb')
        with open(pdb_file, 'w') as f:
            f.write(response.text)
    else:
        print(mgnify_id, response.status_code)
        print(response.text)

To speed it up, I parallelize it over my list using a ProcessPool. For some IDs I then get the following:

MGYP000547115894 403
{"message":"Forbidden"}
MGYP000526977164 403
{"message":"Forbidden"}
MGYP003555372774 403
{"message":"Forbidden"}

Could you provide some details if/what rate limiting is in place for querying ESMAtlas, or if there is another way to retrieve structure files?

Thanks!

nikita-smetanin commented 2 years ago

Hi @fteufel please refer to https://github.com/facebookresearch/esm/blob/main/scripts/atlas/README.md for bulk download of structures. These APIs are for infrequent requests and they are rate limited accordingly.

fteufel commented 2 years ago

Hi,

so according to the README, it's either download the full DB, or be rate limited?

tomsercu commented 1 year ago

Hi @fteufel, correct we set it up to be exactly this: small volume of individual queries (typically used to display in the webapp), or bulk download. The reason being that the DB is 617M small files, which is suboptimal from a systems perspective in almost all cases (617M S3 queries, your typical HPC cluster will also break if you try to write too many small files to NFS). We didn't exactly envision that there's a useful way to query certain subsets of the database except by searching against the DB in bulk.

But as I'm typing this I guess there's an obvious usecase for downloading subset, which is sequence search, then download the structures for the IDs returned by search. Is your use case something like this?

fteufel commented 1 year ago

Hi @tomsercu that's exactly what happened. I had a search result and wanted to download all the hit structures.

Note that AF DB doesn't support batched requests either as far as I know, but the rate limit there seems more generous, so I never had any issues with this use case.

tomsercu commented 1 year ago

This makes a lot of sense, we'll increase rate limits for all GET requests which just fetch data from the atlas.

vprobon commented 1 year ago

How about rate limits for predictions of relatively small datasets of short proteins? In our case it is about 2500 protein sequences of 25 amino acids each.

Do you prefer setting a daily limit or should we rather use sleep() to send small batches of requests in regular time intervals?

Many thanks, Vasilis

tomsercu commented 1 year ago

2500 protein sequences of 25 amino acids each.

@vprobon We don't have control at such a granular level to change rate limits per sequence length. All bulk folding should be done by running the code directly, eg a good start is to do this via the colabfold or huggingface notebooks which are easily scripted to iterate over a set of proteins.

vprobon commented 1 year ago

@tomsercu Thank you!

nikita-smetanin commented 1 year ago

Hi @fteufel we've just updated the rate limiter to support up to 100 rps per user for the /fetch* endpoints. I hope that covers your case.

fteufel commented 1 year ago

Hi @nikitos9000 I just ran the same code again, using one search result for which to download all structures - all that changed for me is that it now looks like

MGYP003385626744 429
Too Many Requests
MGYP003385626744 429
Too Many Requests

Anything I should take care of from my side to avoid it? I'm trying running 20 workers in parallel, and expect that I hit the limit in some calls because I'm not having any checks in place to avoid sending more than 100 at a time. But still, the total speed (with retrying) seems the same as before right now.

fteufel commented 1 year ago

Nevermind, I ran a few more test cases and I'm sure it goes faster now.