Open HushKuo opened 2 weeks ago
@HushKuo thanks for your question.
If you have a large number of assemblies to look up, you'll probably get a faster response for this kind of query from the ENA API. E.g.:
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="analysis_accession=ERZ1749741"\&fields\=sample_accession
sample_accession analysis_accession
SAMEA2619376 ERZ1749741
(ENA's portal API tends to have faster indexes for this kind of lookup, and MGnify's underlying data model is just an extension of ENAs. ENA allow quite a lot of requests per second, so you can iterate through your list and call this endpoint to build up your complete mapping.)
If you happen to know study IDs, you could also look up this mapping in fewer queries, e.g.
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="study_accession=PRJEB22493"\&fields\=sample_accession
sample_accession analysis_accession
SAMEA2619376 ERZ1749741
SAMEA2619879 ERZ835454
SAMEA2623826 ERZ843003
SAMEA2591122 ERZ829058
....
@HushKuo thanks for your question.
If you have a large number of assemblies to look up, you'll probably get a faster response for this kind of query from the ENA API. E.g.:
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="analysis_accession=ERZ1749741"\&fields\=sample_accession sample_accession analysis_accession SAMEA2619376 ERZ1749741
(ENA's portal API tends to have faster indexes for this kind of lookup, and MGnify's underlying data model is just an extension of ENAs. ENA allow quite a lot of requests per second, so you can iterate through your list and call this endpoint to build up your complete mapping.)
If you happen to know study IDs, you could also look up this mapping in fewer queries, e.g.
curl https://www.ebi.ac.uk/ena/portal/api/search\?result\=analysis\&dataPortal\=metagenome\&format\=tsv\&query\="study_accession=PRJEB22493"\&fields\=sample_accession sample_accession analysis_accession SAMEA2619376 ERZ1749741 SAMEA2619879 ERZ835454 SAMEA2623826 ERZ843003 SAMEA2591122 ERZ829058 ....
@SandyRogers thank you for your fast reply!
I wonder, can I use a list file to conduct query from the ENA API?
I only obtained a large number of assembly_ids (probably a few thousand 'ERZXXXXX's, unknown study IDs) from the done upstream analysis ( blastp against the protein DB, mapping each 'MGYPXXX' to 'ERZXXXXX') . Now I want to use these 'ERZXXXXX's to trace the matched sample_ids (e.g., ERS487899), and further get the sample information (e.g., longitude, latitude, sample description, etc.) outputted. But I don't know how to do this more efficiently and in batches.
Thank you for your attention and look forward to your reply!
@HushKuo if it is only a few thousand, you can just iterate and call the ENA API endpoints individually for each one. The rate limits on the ENA API are 50 req/second, so if you iterate through your list one at a time you shouldn't hit this limit and you should get through your list in a matter of minutes. ENA API docs
@SandyRogers Thank you, I will give it a try.
Hi MGnify team,
Thanks a lot for your work!!!
But I still got some problem during my data processing. Actually, I only need a mapping file that summarizes sample_infos and assembly_ids.
Specifically, I have got a list of assembly_ids (Prefixed with 'ERZ', eg., ERZ1749741), and I need to trace them back to the corresponding samle_infos (Prefixed with 'SRS', e.g., ERS487899). But I can't do it at once to deal with all accesion numbers in assembly_ids list.
Do you have any good scripts or ideas that can help me solve this problem?
Thank you very much and look forward to your reply!!!