EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline
Other
93 stars 21 forks source link

Can't fetch ena data with biome list #79

Open amardeepranu opened 7 months ago

amardeepranu commented 7 months ago

https://github.com/EBI-Metagenomics/genomes-pipeline/blob/853487f6dda1420fd8b6b41dd4aff5c8540c7e37/bin/fetch_ena.py#L66

The method above returns nothing. I believe it's because metagenome_source returns empty results via the API:

https://www.ebi.ac.uk/ena/portal/api/search?result=wgs_set&query=assembly_type%3D%22metagenome-assembled%20genome%20%28mag%29%22&fields=study_accession%2Cmetagenome_source&limit=10&format=json&download=false

[
  {
    "study_accession": "PRJEB35770",
    "metagenome_source": "",
    "accession": "CAEMXZ010000000"
  },
  ....
  {
    "study_accession": "PRJEB35770",
    "metagenome_source": "",
    "accession": "CAESAJ010000000"
  }
]

Not sure what field would work here to get the biome, when hitting the api to get the search fields for wgs_set I get a 500: https://www.ebi.ac.uk/ena/portal/api/searchFields?dataPortal=metagenome&result=wgs_set&format=json - Is there an alternate field that contains the biome? Any workaround here? Thanks!

tgurbich commented 7 months ago

Hi @amardeepranu,

Thanks for spotting and reporting this. The reason the function returns an empty result is because the ENA API changed the order of its columns recently and this script hasn't been adjusted to handle that. We will fix it in the new year.

The metagenome_source can be empty but it isn't always, you can see that if you run: https://www.ebi.ac.uk/ena/portal/api/search?result=wgs_set&query=assembly_type%3D%22metagenome-assembled%20genome%20%28mag%29%22&fields=study_accession%2Cmetagenome_source&limit=10000&format=json&download=false

May I ask if are you working on generating a biome-specific catalogue yourself using public data? A workaround and perhaps an easier, more appropriate way of collecting genomes from a biome of interest would be to do a search in ENA/NCBI first and supply the scripts with genome accessions rather than a list of biomes. It would help to understand what you are trying to do to advise better.

Kind regards, Tanya