Error reading reads files

DrB-S commented 1 year ago

I am getting an error running Cecret v.3.6.20230425 on a set of gzipped paired-end reads from wastewater. Here is the command-line:
nextflow -bg run UPHL-BioNGS/Cecret -profile singularity --reads reads

Error message: Missing fromPath parameter FATAL : No input files were found! No paired-end fastq files were found at /data/Sequence_analysis/Cecret/Analyses/Covid_wastewater/wastewater_25Apr2023/reads. Set 'params.reads' to directory with paired-end reads

Here is a subset of the reads files: -rwxrwxr-x 2 becksts becksts 836330937 Apr 25 16:42 AN20230313_1.fastq.gz -rwxrwxr-x 2 becksts becksts 858250873 Apr 25 16:42 AN20230313_2.fastq.gz -rwxrwxr-x 2 becksts becksts 1105000849 Apr 25 16:42 BR20230307_1.fastq.gz -rwxrwxr-x 2 becksts becksts 1140275026 Apr 25 16:42 BR20230307_2.fastq.gz

I have tried changing the path to an absolute path, but I cannot get past this error.

erinyoung commented 1 year ago

That's a strange error. Looks like it might not be able to pair your files. It might be worthwhile to create a sample sheet instead and read that in with the --sample_sheet

The sample sheet format is

sample,fastq_1,fastq_2 AN20230313,reads/AN20230313_1.fastq.gz,reads/AN20230313_2.fastq.gz

Etc

DrB-S commented 1 year ago

Unfortunately, the sample sheet did not help. When I use --fastas and --sample_sheet on the command-line, without referring to a config file, it runs the sarscov2 analysis:

Using the subworkflow for SARS-CoV-2 The files and directory for results is /data/Sequence_analysis/Cecret/Analyses/Covid_wastewater/wastewater_25Apr2023/cecret Sample sheet found : /data/Sequence_analysis/Cecret/Analyses/Covid_wastewater/wastewater_25Apr2023/sample_sheet.txt Amplicon BedFile : /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/schema/artic_V4_SARS-CoV-2.insert.bed Reference Genome : /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/genomes/MN908947.3.fasta GFF file for Reference Genome : /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/genomes/MN908947.3.gff Primer BedFile : /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/schema/artic_V4_SARS-CoV-2.primer.bed Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null Paired-end Fastq files found : null

The fastq files are actually found, even though the message indicates otherwise, and the program is comparing those files against sarscov2. This does not find anything obvious prevents analysis of other viral genomes. If I specify a config file, it fails right away.

Is there a way to use Cecret to determine which viral genomes are in the wastewater and compare against those, instead of comparing specifically for sarscov2?

erinyoung commented 1 year ago

Nextflow can't parse the names of your fastq files (see the warning under https://github.com/UPHL-BioNGS/Cecret#getting-files-from-directories)

To side-step this issue:

rename your directory from reads to something else so that the workflow doesn't try to automatically use those files
use a sample sheet, and specify that sample sheet in your config file

erinyoung commented 1 year ago

Is there a way to use Cecret to determine which viral genomes are in the wastewater and compare against those, instead of comparing specifically for sarscov2?

Cecret is reference-based at its core, and is expecting the user to know what genome they are looking for. I think you are hoping for a more-metagenomic analysis. Cecret can use Kraken2 to classify reads, but then it doesn't not attempt to bin them or align them to multiple references.

Have you tried MAG?

DrB-S commented 1 year ago

Thanks! I’ll try MAG.

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On May 1, 2023, at 8:53 AM, Young @.***> wrote:

Is there a way to use Cecret to determine which viral genomes are in the wastewater and compare against those, instead of comparing specifically for sarscov2?

Cecret is reference-based at its core, and is expecting the user to know what genome they are looking for. I think you are hoping for a more-metagenomic analysis. Cecret can use Kraken2 to classify reads, but then it doesn't not attempt to bin them or align them to multiple references.

Have you tried MAG https://nf-co.re/mag?

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/166#issuecomment-1529859098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJWCI2P5HBCVRPW4J6LXD7L6RANCNFSM6AAAAAAXPWZSKE. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.

erinyoung commented 1 year ago

Best of luck to you! If you run into issues with MAG, just ask their slack channel. They are a friendly, helpful bunch of people in my experience.

DrB-S commented 1 year ago

Thanks

Stephen M. Beckstrom-Sternberg, PhD Bioinformatics Contractor

Arizona State Public Health Lab Arizona Department of Health Services Cell: (602) 653-5011 Email: @.***

On May 1, 2023, at 8:58 AM, Young @.***> wrote:

Best of luck to you! If you run into issues with MAG, just ask their slack channel. They are a friendly, helpful bunch of people in my experience.

— Reply to this email directly, view it on GitHub https://github.com/UPHL-BioNGS/Cecret/issues/166#issuecomment-1529871278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVTVLJUJICAHYYARN45MJELXD7MTXANCNFSM6AAAAAAXPWZSKE. You are receiving this because you authored the thread.

-- CONFIDENTIALITY NOTICE: This e-mail is the property of the Arizona Department of Health Services and contains information that may be PRIVILEGED, CONFIDENTIAL, or otherwise exempt from disclosure by applicable law. It is intended only for the person(s) to whom it is addressed. If you have received this communication in error, please do not retain or distribute it. Please notify the sender immediately by e-mail at the address shown above and delete the original message. Thank you.

UPHL-BioNGS / Cecret

Error reading reads files #166