Gibbons-Lab / medi

Metagenomic Estimation of Dietary Intake and Content.
Apache License 2.0
8 stars 4 forks source link

add_existing failed #10

Open jjoropezav opened 5 months ago

jjoropezav commented 5 months ago

Hello again, sorry to bother

Found this error running build_kraken.nf, tried 3 times with the same result


Apr-10 21:34:56.803 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'add_existing (1)'

Caused by: Process add_existing (1) terminated with an error exit status (25)

Command executed:

kraken2-build --download-library bacteria --db medi_db --threads 4

Command exit status: 25

Command output: (empty)

Command error: Step 1/2: Performing rsync file transfer of requested files rsync: link_stat "/all/GCF/037/832/925/GCF_037832925.1_ASM3783292v1/GCF_037832925.1_ASM3783292v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1684) [generator=3.1.3] rsync_from_ncbi.pl: rsync error, exiting: 5888

Work dir: /scratch/home/joropeza/medi/work/b4/7605aa405f3eaed44400ee14867236


seems that the sequence is suppressed in ncbi: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_037832925.1/

attached is the log file from nextflow, any work around it? .nextflow.log GCF_037832925.1_ASM3783292v1_genomic.fna.gz

Could we use a premade kraken database to fix this issue? https://benlangmead.github.io/aws-indexes/k2

Thanks again!

cdiener commented 5 months ago

Hi, this is an error in Kraken2 itself unfortunately. However it seems like a timing issue because Kraken2 usually does a dry-run first and flags files that can not be downloaded. So I would think that cases like this would usually be caught. However, if the genome got suppressed between the dry run and download (which can happen, especially if the download is slow and takes a while) this can happen. The easiest fix would be to just to rerun the download at another date.

We would love to provide prebuilt hashes, the issue is the size (~600GB) because there is currently no public repository that lets you deposit data of that size for free. I will try to apply for the AWS program Kraken2 uses, but there is no guarantee it will be granted. We are also floating the idea of a subsampled hash (to 128GB) which could be uploaded to existing repos. Since we are rebuilding the database for the revisions it will take a bit though (roughly when the paper is published).

Sorry for the inconvenience!

jjoropezav commented 5 months ago

Oh, i see the problem now

I have space available on my Google Drive account and can keep that link open without issues for at least a year meanwhile. This could be a viable solution for hosting the prebuilt hashes

i dont know if that would work

Thanks again for the help