pipeline: differential_expression:deAnalysis (1)` terminated with an error exit status (1)

afazhra commented 4 days ago

I’m facing an issue with the report below.

we have a total of six samples divided into two groups: treated and control. Each sample has been processed independently up to the DE Analysis step, with no issues observed in earlier stages. All FASTQ files concatenated without errors, and the sample sheet is configured correctly, ensuring proper sample separation between groups.

The entire workflow progressed smoothly up until the DE Analysis step,

I tried running it with 20 threads

Thank you.

this is my command: nextflow run epi2me-labs/wf-transcriptomes --fastq fastq_pass/ --transcriptome_source precomputed --ref_genome ../DATA/GCF_012489685.1_LjGifu_v1.2_genomic.fna.gz --ref_transcriptome ../DATA/GCF_012489685.1_LjGifu_v1.2_rna.fna.gz --ref_annotation ../DATA/GCF_012489685.1_LjGifu_v1.2_genomic.gtf.gz --de_analysis --threads 20 --cdna_kit SQK-PCB114 --sample_sheet sample_sheet.csv -c memory.config -resume

this is a log report

`ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'

Caused by: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)

Command executed:

mkdir merged mkdir de_analysis de_analysis.R annotation.gtf 3 1 10 3 "sample_sheet.csv"

Command exit status: 1

Command output: Loading counts, conditions and parameters. Checking annotation file type. Annotation file type is gtf. Checking annotation file for presence of transcript_id versions. Annotation file transcript_ids include versions. Loading annotation database. Filtering counts using DRIMSeq. Building model matrix. Sum transcript counts into gene counts. Running differential gene expression analysis using edgeR. Running differential transcript usage analysis using DEXSeq.

Command error: package 'DRIMSeq' was built under R version 4.3.2 Warning messages: 1: package 'GenomicFeatures' was built under R version 4.3.2 2: package 'BiocGenerics' was built under R version 4.3.2 3: package 'S4Vectors' was built under R version 4.3.3 4: package 'IRanges' was built under R version 4.3.3 5: package 'GenomeInfoDb' was built under R version 4.3.2 6: package 'GenomicRanges' was built under R version 4.3.3 7: package 'AnnotationDbi' was built under R version 4.3.2 8: package 'Biobase' was built under R version 4.3.3 Warning messages: 1: package 'edgeR' was built under R version 4.3.3 2: package 'limma' was built under R version 4.3.3 Loading counts, conditions and parameters. Checking annotation file type. Annotation file type is gtf. Checking annotation file for presence of transcript_id versions. Annotation file transcript_ids include versions. Loading annotation database. Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. 'select()' returned 1:many mapping between keys and columns Filtering counts using DRIMSeq. Building model matrix. Warning message: package 'dplyr' was built under R version 4.3.3 Sum transcript counts into gene counts. Running differential gene expression analysis using edgeR. Warning messages: 1: package 'DEXSeq' was built under R version 4.3.3 2: package 'BiocParallel' was built under R version 4.3.3 3: package 'SummarizedExperiment' was built under R version 4.3.2 4: package 'MatrixGenerics' was built under R version 4.3.3 5: package 'matrixStats' was built under R version 4.3.3 6: package 'DESeq2' was built under R version 4.3.3 7: package 'RColorBrewer' was built under R version 4.3.3 Running differential transcript usage analysis using DEXSeq. converting counts to integer mode Warning message: In DESeqDataSet(rse, design, ignoreRank = TRUE) : some variables in design formula are characters, converting to factors Error in estimateSizeFactorsForMatrix(featureCounts(object), locfunc, : every gene contains at least one zero, cannot compute log geometric means Calls: estimateSizeFactors ... estimateSizeFactors -> .local -> estimateSizeFactorsForMatrix Execution halted

Container: ontresearch/wf-transcriptomes:shad8671ea3a8ed52f2c0f40355e8eb5c6f00d2cbda

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

-- Check '.nextflow.log' file for details WARN: Killing running tasks (1) `

fgponce commented 2 days ago

Some samples? Do you mean some Experiments? I think DE analysis will need multiple samples to calculate differences. Alternatively, perhaps fastcat is combining multiple samples into one? It will do that if the directory structure isn't what its expecting ie merges demultiplexed files into one sample if the files are in the same folder.

afazhra commented 2 days ago

Apologies for the confusion @fgponce . To clarify, we have a total of six samples divided into two groups: treated and control. Each sample has been processed independently up to the DE Analysis step, with no issues observed in earlier stages. All FASTQ files concatenated without errors, and the sample sheet is configured correctly, ensuring proper sample separation between groups.

The entire workflow progressed smoothly up until the DE Analysis step, so we don’t suspect any problems related to sample handling or the sample sheet format.

fgponce commented 1 day ago

Thats great news @afazhra. It caused me problems since most of the count cols ended up empty so couldn't do DE. I just noticed it mentions lots of zero entries in the error, and thats how my mistake broke the pipeline. I also had a problem where one of my sample names was numeric. The code tries to alter this for the R-steps and adds an x to the col header during processing, it removes this post processing. However, at a check step where it looks to make sure the samplesheet and the counts file have the same col names it errors saying they are different. They aren't hahaha but it must be using the col headers from an earlier step (with the x) instead of the actual col header in the files its about to merge info from.

afazhra commented 1 day ago

Thank you for the insights @fgponce ! My samplesheet currently has the following format:

barcode,sample_id,alias,condition barcode04,Wild_1,Wild_1,control barcode05,Wild_2,Wild_2,control barcode06,Wild_3,Wild_3,control barcode01,H2_1,H2_1,treated barcode02,H2_2,H2_2,treated barcode03,H2_3,H2_3,treated

Each sample name has an underscore with no numeric-only names. Do you think this format could still cause any issues? Also, were there specific steps or settings that helped you prevent empty count columns from impacting the DE analysis? Just making sure I understand fully before re-running the pipeline.

epi2me-labs / wf-transcriptomes

pipeline: differential_expression:deAnalysis (1)` terminated with an error exit status (1) #127