CDCgov / phoenix

🔥🐦🔥PHoeNIx: A short-read pipeline for healthcare-associated and antimicrobial resistant pathogens
Apache License 2.0
50 stars 17 forks source link

SPAdes_Failure that points to empty single reads file in v2.0.2 #132

Closed TK-DPH closed 4 months ago

TK-DPH commented 5 months ago

PHoeNIx version: v2.0.2 Nextflow: 23.04.4 HPC: HiPerGator

PHoeNIx failed 4 isolates due to Auto_QC_Failure_Reason: SPAdes_Failure from each *_summaryline_failure.tsv. Impact of this SPAdes Failure is that most results in the summary files are "Unknown". SPAdes failed due to an empty single reads file "{sample}.singles.fastq.gz" which was supposed to be produced by FastP trimming.

From the nextflow log there is a line for each sample: "error [nextflow.exception.ProcessFailedException]: Process PHOENIX:PHOENIX_EXTERNAL:SPADES_WF:SPADES ({sample}) terminated with an error exit status (255)"

From the spades_wf work directory, each {sample}.synopsis file shows: "ASSEMBLY: FAILED: {sample}.scaffolds.fa.gz not found"

The .command.log files show: "== Error == file is empty: path/{sample}.singles.fastq.gz (single reads, library number: 1, library type: paired-end)"

From the documentation, ".singles.fastq.gz" files are output from fastp_trimd, so I checked that step next in the .nextflow.log (attached below). The TaskProcessor from DEBUG lines all show "Skipping output binding because one or more optional files are missing: fileoutparam<6:1>"

I am trying to find out what went wrong in the FASTP_TRIMD step. From that step, the .command.out is empty and .command.err doesn't say much, so attached is the .command.sh from one isolate out of four.

Attached logs: nextflow.log fastp_trimd-command.sh.txt

jvhagey commented 5 months ago

Hi @TK-DPH can you check that this is still an issue with the newest v2.1.0 version? If so we will continue with troubleshooting.

TK-DPH commented 5 months ago

Hi! Thanks. Just tried v2.1.0 version on a different dataset; it had to go through a corruption check, then skipped to fetch failed summaries. Previously, a test profile run was successful today, but with samples it didn't run tools or generate results. nextflow.log

jvhagey commented 5 months ago

What is the sample name and the full file name? I think this is related to #142 we are working on a patch

TK-DPH commented 5 months ago

File names look like "2021LY00045_1.fastq.gz" and "2021LY00045_2.fastq.gz" for example, and sample name "2021LY00045" . If I go into the phx_output directory for that one, the only file in there is "phx_output/2021LY00045/file_integrity/2021LY00045summary.txt" and it only states the following: PASSED: File 2021LY00045 is not corrupt. PASSED: File 2021LY00045_ is not corrupt.

jvhagey commented 5 months ago

@TK-DPH, yea this is the same name parsing issue we will have to put out a patch to address. For now a fix out be to change the fastq file names to "2021LY00045_R1.fastq.gz" and "2021LY00045_R2.fastq.gz".

TK-DPH commented 4 months ago

Thanks! Renaming the fastq files to these worked by running all tools with v2.1.0, and I'm glad you are working on a patch as well. Back to the original error: I'm working on one E. coli sample to investigate, which was submitted with a previous pipeline before we got phoenix https://www.ncbi.nlm.nih.gov/sra/SRS20051145 for original raw reads input.

v2.0.2 "Auto_QC_Failure_Reason" was "SPAdes_Failure" and lists Unknown to many results in the file. v2.1.0 "Auto_QC_Failure_Reason" was "smaller_than_1000000_bps(0)-coverage_below_30(0)" and many Unknown results.

v2.1.0 has similar problems for this sample and here are some results from v2.1.0 to note.

In the FASTP_TRIMD step .command.sh, here are fastp unparied read names: --unpaired1 2023LY00096_1.fail.fastq.gz --unpaired2 2023LY00096_2.fail.fastq.gz

The FASTP_SINGLES step indicates that both of these files are empty. "Debugging: Emptiness of reads[0] and reads[1] Both are empty" - from debug_status.log

In GET_TRIMD_STATS step the 2023LY00096_summary.txt indicates: PASSED: File 2023LY00096_R1 is not corrupt. PASSED: File 2023LY00096_R2 is not corrupt. PASSED: Read pairs for 2023LY00096 are equal. PASSED: There are reads in 2023LY00096 R1/R2 after trimming. End_of_File

From the SPADES_WF:SPADES 2023LY00096_spades_outcome.csv: "run_completed,no_scaffolds,contigs_created" Attached are some spades step files of note, zipped: SPADES_WF.zip

Thanks for working with me on this, and please let me know if you need any more information.

jvhagey commented 4 months ago

ok, so this is odd when I run SRR27420408 through phx, by pulling directly from NCBI it runs through fine.

My command: nextflow run cdcgov/phoenix -r v2.1.0 -latest -entry SRA -profile singularity,cdcsge --kraken2db $KRAKEN_DB_v2_1 --input_sra ga_samplesheet.csv --outdir GA_bug

Any chance your local copy is different from what is on NCBI?

We run downstream analysis with scaffolds so if only contigs are made, we label it a failure. So the pipeline reporting failure if Spades doesn't make scaffolds is to be expected.

jvhagey commented 4 months ago

@TK-DPH, I also see there is a comment "None of paired reads aligned properly" in the spades.log, which is probably why its not making scaffolds. Again it would be good to confirm that your local copy isn't different from NCBI. Let me know so we can move forward with a patch release and close this issue.