Missing analyses from an assembly ids

zdk123 commented 4 months ago

Hi MGnify team.

Apologies if this is not the correct place to report this issue, but I am interested in getting an protein -> contig -> assembly map and figured I could rely on the mgy_assemblies.tsv file on the FTP server. This contains the protein -> ERZ relationships, and then use the API to pull the contigs from the most recently available analysis (e.g. https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ509256/analyses seems sufficient ).

However, I noticed that a substantial number of the ERZ ids to not have any analysis data. For example https://www.ebi.ac.uk/metagenomics/api/v1/assemblies/ERZ1744782/analyses. By my count - 5840 out of the 33345 ids are missing in the API. Is there another way to get a complete protein -> contig map?

thanks!

SandyRogers commented 4 months ago

Hello @zdk123 , thanks for your question.

There are various reasons why assemblies you find in the protein database releases may not be accessible in the EMG API.

One of the most common reasons is where data have been "suppressed" in ENA (or more rarely, in MGnify) after the protein database snapshot was released, but before you query the EMG API.

We regularly/continuously reflect the suppression state of datasets from ENA into MGnify's live database (behind the API). (Suppression is usually at the request of data submitters.) We do not retrospectively remove proteins that were derived from suppressed assemblies from the released protein database snapshots though.

In the case of your example, this wasn't quite the case, but it was a similar scenario: the assembly was produced, proteins predicted and ingested into the protein database, however that assembly analysis was not uploaded to the EMG API MGnify database. This usually happens when we notice that the assembly or annotation quality is not as good as it could be, and in the case you linked; we reassembled that dataset with a different assembler and that is the one that is on the MGnify API/website.

So in general, you should expect that there are assemblies referenced in the protein database that are not available on the API, and vice versa there will be assemblies/analyses on the API that are not (yet) in a protein database release, since these data products have very different release cadences.

zdk123 commented 4 months ago

Thanks for the quick reply and detailed comments. I have a followup based on what you just said here.

Do you make the retired/hidden contigs/analysis available via any other mechanism other than the API (e.g. maybe a different part of the FTP site)?
In the scenario described above, if the assembly gets retired/re-done - approximately when will assembly map in the ftp site updated to reflect the new protein -> assembly mapping?

SandyRogers commented 4 months ago

Do you make the retired/hidden contigs/analysis available via any other mechanism other than the API (e.g. maybe a different part of the FTP site)?

Not currently. In future, we may well enable FTP (etc) access to analysis data producrs, e.g. GFF files associated with assemblies. It is still likely that we would only serve "current and public" (i.e. not suppressed or embargoed) data products in this way though.

In the scenario described above, if the assembly gets retired/re-done - approximately when will assembly map in the ftp site updated to reflect the new protein -> assembly mapping?

Assuming an identical MGYP was found in the first assembly (ERZ1) and the later re-assembly (ERZ2), the next release of the protein database would include ERZ2 in the mgy_assemblies.tsv. E.g. a line may go from:
MGYP1 ERZ1;ERZ99
MGYP1 ERZ1;ERZ99;ERZ2
MGYPs may and do change though with new assemblies though, since the contigs change.
The old ERZ1 may or may not be removed as well, depending on the reasons for reassembling.
The cadence for new protein database releases should generally be multiple times per year (it has been less frequently recently due to substantial refactoring needed to handle the increasing scale).

EBI-Metagenomics / emgapi

Missing analyses from an assembly ids #359