How to retrieve sample_infos from multiple assembly_ids at once

HushKuo commented 2 weeks ago

Hi MGnify team,

Thanks a lot for your work!!!

But I still got some problem during my data processing. Actually, I only need a mapping file that summarizes sample_infos and assembly_ids.

Specifically, I have got a list of assembly_ids (Prefixed with 'ERZ', eg., ERZ1749741), and I need to trace them back to the corresponding samle_infos (Prefixed with 'SRS', e.g., ERS487899). But I can't do it at once to deal with all accesion numbers in assembly_ids list.

Do you have any good scripts or ideas that can help me solve this problem?

Thank you very much and look forward to your reply!!!

SandyRogers commented 2 weeks ago

@HushKuo thanks for your question.

If you have a large number of assemblies to look up, you'll probably get a faster response for this kind of query from the ENA API. E.g.:

curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="analysis_accession=ERZ1749741"\&fields\=sample_accession
sample_accession    analysis_accession
SAMEA2619376    ERZ1749741

(ENA's portal API tends to have faster indexes for this kind of lookup, and MGnify's underlying data model is just an extension of ENAs. ENA allow quite a lot of requests per second, so you can iterate through your list and call this endpoint to build up your complete mapping.)

If you happen to know study IDs, you could also look up this mapping in fewer queries, e.g.

curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="study_accession=PRJEB22493"\&fields\=sample_accession
sample_accession    analysis_accession
SAMEA2619376    ERZ1749741
SAMEA2619879    ERZ835454
SAMEA2623826    ERZ843003
SAMEA2591122    ERZ829058
....

HushKuo commented 2 weeks ago

@HushKuo thanks for your question.

If you have a large number of assemblies to look up, you'll probably get a faster response for this kind of query from the ENA API. E.g.:
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="analysis_accession=ERZ1749741"\&fields\=sample_accession
sample_accession  analysis_accession
SAMEA2619376  ERZ1749741
(ENA's portal API tends to have faster indexes for this kind of lookup, and MGnify's underlying data model is just an extension of ENAs. ENA allow quite a lot of requests per second, so you can iterate through your list and call this endpoint to build up your complete mapping.)

If you happen to know study IDs, you could also look up this mapping in fewer queries, e.g.
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="study_accession=PRJEB22493"\&fields\=sample_accession
sample_accession  analysis_accession
SAMEA2619376  ERZ1749741
SAMEA2619879  ERZ835454
SAMEA2623826  ERZ843003
SAMEA2591122  ERZ829058
....

@SandyRogers thank you for your fast reply!

I wonder, can I use a list file to conduct query from the ENA API?

I only obtained a large number of assembly_ids (probably a few thousand 'ERZXXXXX's, unknown study IDs) from the done upstream analysis ( blastp against the protein DB, mapping each 'MGYPXXX' to 'ERZXXXXX') . Now I want to use these 'ERZXXXXX's to trace the matched sample_ids (e.g., ERS487899), and further get the sample information (e.g., longitude, latitude, sample description, etc.) outputted. But I don't know how to do this more efficiently and in batches.

Thank you for your attention and look forward to your reply!

SandyRogers commented 2 weeks ago

@HushKuo if it is only a few thousand, you can just iterate and call the ENA API endpoints individually for each one. The rate limits on the ENA API are 50 req/second, so if you iterate through your list one at a time you shouldn't hit this limit and you should get through your list in a matter of minutes. ENA API docs

HushKuo commented 2 weeks ago

@SandyRogers Thank you, I will give it a try.

EBI-Metagenomics / emgapi

How to retrieve sample_infos from multiple assembly_ids at once #367