EBI-Metagenomics / genome_uploader

Python script to upload bins and MAGs to ENA (European Nucleotide Archive)
Apache License 2.0
20 stars 3 forks source link

Wrong link to sample #6

Closed SilasK closed 1 year ago

SilasK commented 1 year ago

Hello, I'm now at the webin validate step.

I got this error file:

ERROR: Could not find sample "ERS14576595". The sample must be owned by the submission account used for this submission or it must be private or temporarily suppressed and referenced by accession. Note that only a single sample can be referenced. Unknown sample ERS14576595 or the sample cannot be referenced by your submission account. Samples must be submitted before they can be referenced in the submission. [manifest file: /scratch/rdkiesersi1/CMMG/Batches/batch1/MAG_upload/manifests/MGG00155.manifest, line number: 2, field: SAMPLE, value: ERS14576595]

I have this manifest file

STUDY   PRJNA646353
SAMPLE  ERS14576595
ASSEMBLYNAME    MGG00155_1675429372
ASSEMBLY_TYPE   Metagenome-Assembled Genome (MAG)
COVERAGE        100.0
PROGRAM metaSpades_v3.13
PLATFORM        Illumina HiSeq 2500
MOLECULETYPE    genomic DNA
DESCRIPTION     This is a bin derived from the primary whole genome shotgun (WGS) data set ERP022980. This sample represents a Metagenome-Assembled Genome (MAG) from the metagenomic run ERR1989821.
RUN_REF ERR1989821
FASTA   /scratch/rdkiesersi1/CMMG/genomes/MGG00155.fasta.gz
SilasK commented 1 year ago

The genome assembled from the run ERR1989821 should link to ERS1755325 but the manifest file states ERS14576595

Apparently, there is an error in the upload script.!!

SilasK commented 1 year ago

There are other problems with ERS14576595.

SilasK commented 1 year ago

I checked another manifest file

ERR1989822 -> ERS1755326 but in the manifest file states ERS14576600

Ge94 commented 1 year ago

Hi Silas, No worries, the ERS accession in the manifest refers to the newly registered sample for genome upload. The RUN_REF field containing the ERR accession is the field needed to link the two - the original run and the new genome sample.

About ERS14576595, the error states it doesn't exist. My first shout would be that the sample was registered in test mode - you could double check this by looking at the script's output files. Test submissions only live for 24 hours in ENA's test server, therefore this might be the issue if you launched the genome_uploader yesterday or earlier. If ERS14576595 appears in registered_MAGs_test.tsv, then you will need to re-run the script in either test or live mode according to your needs.

If this is not the case, I would please ask you to share the commands you have used.

SilasK commented 1 year ago

What do you mean with"new genome sample"? My MAGs are assembled from existing sra runs and samples. Why would it create a new and wrong sample??

Ge94 commented 1 year ago

The way ENA works is that you have to register a sample for each genome to be submitted - this is part of what the genome uploader does. So basically a genome is going to reference the already existing sample (e.g. ERS1755325), but it needs to be associated to a new ERS accession (e.g. ERS14576595). These accessions are the ones to be found in the registered_MAGs(_test).tsv and manifest files. It is not wrong, it is the way ENA links data internally. Going back to the query above, please look at what I pointed out to double check what category you fall into. Happy to help to solve this issue.

SilasK commented 1 year ago

My code is here: https://github.com/SilasK/upload_genomes/blob/main/Snakefile

I didn't specify anything so the genome uploader script would be in test mode. I run it again with the live mode..

Ge94 commented 1 year ago

Hi Silas, I figure that rerunning the script, either in test or live mode, should work this time. Let me know if this is not the case!