IndexThePlanet / Logan

Logan Unitigs and Contigs
105 stars 3 forks source link

Filtering files by metadata #6

Closed durrantmm closed 2 months ago

durrantmm commented 2 months ago

Hello, I was hoping to map all Logan files to their species of origin so I could do filtered downloads. Do you have these data on hand already by chance? If not, is there a faster way to access this data for all for the ~26 million files than Entrez?

apcamargo commented 2 months ago

I wrote some code to retrieve this sort of data from ENA using their API. I just made it available here: https://github.com/apcamargo/retrieve-ena-metadata.

The metadata you'll get includes the tax_id field. The repository contains a notebook with an example at the end, showing how to parse a taxonomy ID to get the full lineage. It did take a few days to finish, though. The fastest solution is probably NCBI BigQuery.

durrantmm commented 2 months ago

Thanks! I decided to just go with entrez and got all the information I needed.