apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

genomad annotate fastq file is empty or contains multiple entries #107

Open skose82 opened 1 week ago

skose82 commented 1 week ago

Hi I am getting the following error:

sample1.fastq.gz is either empty or contains multiple entries with the same identifier. Please check your input FASTA file and execute genomad annotate again.

Not sure what's happening as the file isn't empty and works fine with other scripts/programs.

Is there a way to run genomad with pair-end fastq files or do hey need to be interleavened first?

Command is: genomad end-to-end --cleanup --splits 20 sample1_R1.fastq.gz genomad_output genomad_db

Thanks

apcamargo commented 1 week ago

That's because it's a FASTQ file. I should add a message explicitly stating that those are not supported.

Can you convert it to FASTA with seqkit fq2fa? Also, are those short or long reads? geNomad is not designed to work with short sequences such as 150bp reads.

skose82 commented 1 week ago

Thank you for the quick response. They are short reads and worked with the nf-core/mag https://nf-co.re/mag/3.0.0 workflow for metagenomes so I was trying to use it as a standalone package. Mostly because they were assembled genomes I believe. So pair-end short reads are a no go with genomad? Do you know of any packages that can do similar with pair-end data?

Thanks

apcamargo commented 1 week ago

If you ran this pipeline, you should have an assembly (or multiple) for your data. The workflow assembles metagenomes prior to binning.

Regarding doing the analysis directly on reads, it depends on what you want. If you want to evaluate presence of known viruses, you could run something like PHANTA or KMCP. For discovery is new viruses or description of virus genomes, this won't do it, you will need assemblies first.