Closed fteufel closed 1 year ago
Hi @fteufel please refer to https://github.com/facebookresearch/esm/blob/main/scripts/atlas/README.md for bulk download of structures. These APIs are for infrequent requests and they are rate limited accordingly.
Hi,
so according to the README, it's either download the full DB, or be rate limited?
Hi @fteufel, correct we set it up to be exactly this: small volume of individual queries (typically used to display in the webapp), or bulk download. The reason being that the DB is 617M small files, which is suboptimal from a systems perspective in almost all cases (617M S3 queries, your typical HPC cluster will also break if you try to write too many small files to NFS). We didn't exactly envision that there's a useful way to query certain subsets of the database except by searching against the DB in bulk.
But as I'm typing this I guess there's an obvious usecase for downloading subset, which is sequence search, then download the structures for the IDs returned by search. Is your use case something like this?
Hi @tomsercu that's exactly what happened. I had a search result and wanted to download all the hit structures.
Note that AF DB doesn't support batched requests either as far as I know, but the rate limit there seems more generous, so I never had any issues with this use case.
This makes a lot of sense, we'll increase rate limits for all GET requests which just fetch data from the atlas.
How about rate limits for predictions of relatively small datasets of short proteins? In our case it is about 2500 protein sequences of 25 amino acids each.
Do you prefer setting a daily limit or should we rather use sleep() to send small batches of requests in regular time intervals?
Many thanks, Vasilis
2500 protein sequences of 25 amino acids each.
@vprobon We don't have control at such a granular level to change rate limits per sequence length. All bulk folding should be done by running the code directly, eg a good start is to do this via the colabfold or huggingface notebooks which are easily scripted to iterate over a set of proteins.
@tomsercu Thank you!
Hi @fteufel we've just updated the rate limiter to support up to 100 rps per user for the /fetch* endpoints. I hope that covers your case.
Hi @nikitos9000 I just ran the same code again, using one search result for which to download all structures - all that changed for me is that it now looks like
MGYP003385626744 429
Too Many Requests
MGYP003385626744 429
Too Many Requests
Anything I should take care of from my side to avoid it? I'm trying running 20 workers in parallel, and expect that I hit the limit in some calls because I'm not having any checks in place to avoid sending more than 100 at a time. But still, the total speed (with retrying) seems the same as before right now.
Nevermind, I ran a few more test cases and I'm sure it goes faster now.
Hi,
I would like to automatically retrieve structures from ESMAtlas. Basically I have a list of MGinfy IDs that I want to retrieve.
I do that using the following:
To speed it up, I parallelize it over my list using a ProcessPool. For some IDs I then get the following:
Could you provide some details if/what rate limiting is in place for querying ESMAtlas, or if there is another way to retrieve structure files?
Thanks!