epi2me-labs / wf-transcriptomes

Other
64 stars 30 forks source link

Error in Differential expression analysis #67

Closed KatrinMoller closed 3 months ago

KatrinMoller commented 4 months ago

Operating System

Other Linux (please specify below)

Other Linux

No response

Workflow Version

v1.0.0

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

OUTPUT=~/output; ./nextflow run epi2me-labs/wf-transcriptomes \ -r prerelease \ -profile singularity \ --fastq /hpcdata/Mimir/shared/km100/all_libs \ --de_analysis \ --ref_genome Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz\ --ref_annotation Homo_sapiens.GRCh38.110.gtf.gz \ --ref_transcriptome Homo_sapiens.GRCh38.cdna.all.fa.gz\ --sample_sheet sample_sheet.csv \ --cdna_kit "SQK-PCS111" \ --isoform_table_nrows 10000 \ --out_dir outdir -w workspace_dir \ --threads 64

Workflow Execution - CLI Execution Profile

singularity

What happened?

I was trying out the whole pipeline, using the differential expression analysis, using 10 samples (5 ctrl, 5 mut). The run went on for 3 days since the data is large but finished most of the processing steps before deAnalysis. But at that point it gave an error and stopped the run. The strange thing is that I got some intermediate files (.bam, .bai, .fas) for some of the samples, but not all, folders with _pychopper_output for some and folders with _gffcompare for some. I am not managing to debug this myself, would appreciate if someone could have a look. I attached the .nextflow log file and the trace file, if it helps trace.txt

.nextflow.log

Relevant log output

Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: 1.

The full error message was:

Error executing process > 'pipeline:differential_expression:deAnalysis'

Caused by:
  Process `pipeline:differential_expression:deAnalysis` terminated with an error exit status (1)

Command executed:

  mkdir merged
  mkdir de_analysis
  mv de_transcript_counts.tsv merged/all_counts.tsv
  mv sample_sheet.csv de_analysis/coldata.tsv
  de_analysis.R annotation.gtf 3 1 10 3 gtf true

Command exit status:
  1

Command output:
  Loading counts, conditions and parameters.
  Loading annotation database.
  Filtering counts using DRIMSeq.

Command error:
  Loading counts, conditions and parameters.
  Loading annotation database.
  Import genomic features from the file as a GRanges object ... OK
  Prepare the 'metadata' data frame ... OK
  Make the TxDb object ... OK
  Warning message:
  In .get_cds_IDX(mcols0$type, mcols0$phase) :
    The "phase" metadata column contains non-NA values for features of type
    stop_codon. This information was ignored.
  'select()' returned 1:many mapping between keys and columns
  Filtering counts using DRIMSeq.
  Error in .local(x, ...) : 
    min_samps_gene_expr >= 0 && min_samps_gene_expr <= ncol(x@counts) is not TRUE
  Calls: dmFilter -> dmFilter -> .local -> stopifnot
  Execution halted

Work dir:
  /hpcdata/Mimir/shared/km100/workspace_dir/68/312a7c64be4d80177172c2499883cf

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Application activity log entry

No response

sarahjeeeze commented 4 months ago

Oh no sorry to hear that! Would you be able to share the sample_sheet.csv ?

Also in the word dir folder work/68/312a7c64be4d80177172c2499883cf is there a file merged/all_counts.tsv and does it contain data?

On a side note, you may already be aware but because you have added a ref_transcriptome parameter it will be using that for DE_analysis. If you want it to use the transcriptome it generates from the workflow you need to leave this parameter out. There is a note in the log Reference Transcriptome provided will be used for differential expression. We are looking to make this clearer in a future release.

KatrinMoller commented 4 months ago

@sarahjeeeze Thanks for getting back to me. Here is the sample sheet. Also I looked in that particular folder, and there is indeed a file called all_counts.tsv that contains transcript data for the samples. sample_sheet.csv

I did not realise I should not put in the reference transcriptome, I somehow understood that I HAD to put that in if I wanted to use the DE_analysis part. Thank you for pointing that out, I will try that next time!

sarahjeeeze commented 4 months ago

Thanks, I will try to recreate your error and get back to you shortly. No sorry its only if you put 'precomputed' as the 'transcriptome_source' parameter that you have to include the 'ref_transcriptome' - I admit its not that clear in the docs, will improve in a future release.

sarahjeeeze commented 4 months ago

Hi, thanks for sharing your sample sheet, it helped us find a bug that occurs when you use 10 or more samples - Will let you know when the fix is ready.

phillip-richmond-alamya commented 4 months ago

I have a different issue but it's also with the de_analysis so I'm adding here.

Launching from nextflow tower.

Ref annotation: gencode.v44.annotation.gtf

The exit status of the task that caused the workflow execution to fail was: 1

Error executing process > 'pipeline:differential_expression:deAnalysis (1)'

Caused by:
  Essential container in task exited

Command executed:

  mkdir merged
  mkdir de_analysis
  mv de_transcript_counts.tsv merged/all_counts.tsv
  mv Samplesheet_BCH_LRS_RNAseq.csv de_analysis/coldata.tsv
  de_analysis.R annotation.gtf 3 1 10 3 gtf true

Command exit status:
  1

Command output:
  Loading counts, conditions and parameters.
  Loading annotation database.

Command error:
  Import genomic features from the file as a GRanges object ... OK
  Prepare the 'metadata' data frame ... OK
  Loading counts, conditions and parameters.
  Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) : 
  Loading annotation database.
    values in 'transcripts$tx_strand' must be "+" or "-"
  Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
  Execution halted
  7:01PM INF shutdown filesystem start
  7:01PM INF shutdown filesystem done

Work dir:
  s3://alamya-btg-production-processing/nextflow_processing/78/2ad99ad4f03f08b1fbc9a36d262f68

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

This is my sample sheet

barcode,alias,condition
barcode01,A23-BCH-001-F-LRS-RNA,control
barcode02,A23-BCH-001-M-LRS-RNA,control
barcode03,A23-BCH-001-P-LRS-RNA,case
barcode04,A23-BCH-002-F-LRS-RNA,control
barcode05,A23-BCH-002-M-LRS-RNA,control
barcode06,A23-BCH-002-P-LRS-RNA,case
barcode07,A23-BCH-003-F-LRS-RNA,control
barcode08,A23-BCH-003-M-LRS-RNA,control
barcode09,A23-BCH-003-P-LRS-RNA,case
sarahjeeeze commented 4 months ago

Hey, that looks like an older version of the workflow - would you be able to update to the latest and let me know if the issue persists?

sarahjeeeze commented 4 months ago

Also would you be able to share the reference annotation or check it doesnt contain any non stranded annotations with

grep -P '\t\.\t\.\t' annotation.gtf
# remove them with 
grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf
sarahjeeeze commented 3 months ago

Closing through lack of response.

KatrinMoller commented 3 months ago

Hi @sarahjeeeze I am curious if your last two comments were intended for me or for phillip-richmond-alamya? I thought they were directed to phillip, which is why I didn't respond, as your last reply to me was the issue is because of numbers of samples and you would let me know when you had a fix. I used the prerelease since v.1.1.0 wasn't available yet, but I can surely try that also!

eiwai81 commented 3 months ago

HI @sarahjeeeze I am also interested in this.

I have not been able to successfully run this workflow since I started trying from last year. I just tried with the latest release using this command:

nextflow run epi2me-labs/wf-transcriptomes \
--fastq $FASTQ_DIR \
--ref_genome $REF \
--ref_annotation $GTF \
--transcriptome-source reference-guided \
--minimap2_index_opts '-k 15' \
--pychopper_opts '-k PCS111' \
--out_dir $OUTPUT \
--sample_sheet $SAMPLESHEET \
--de_analysis \
--cdna_kit "SQK-PCS111" \
--isoform_table_nrows 10000 \
-profile singularity --threads 16 

Here is what my sample_sheet.csv file looks like:

barcode,sample_id,alias,condition
barcode01,b01,control_rep1,control
barcode02,b02,control_rep2,control
barcode03,b03,control_rep3,control
barcode04,b04,control_rep4,control
barcode05,b05,control_rep5,control
barcode06,b06,extra_late_rep1,extra_late
barcode07,b07,extra_late_rep2,extra_late
barcode08,b08,extra_late_rep3,extra_late
barcode09,b09,extra_late_rep4,extra_late
barcode10,b10,extra_late_rep5,extra_late

This time I got the error below:

Command output:
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.

Command error:
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  Loading counts, conditions and parameters.
  Checking annotation file type.
  Annotation file type is gtf.
  Checking annotation file for presence of transcript_id versions.
  Annotation file transcript_ids include versions.
  Loading annotation database.
  Import genomic features from the file as a GRanges object ... OK
  Prepare the 'metadata' data frame ... OK
  Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) : 
    values in 'transcripts$tx_strand' must be "+" or "-"
  Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
  In addition: Warning messages:
  1: In for (i in seq_along(specs)) { :
    closing unused connection 3 (annotation.gtf)
  2: In for (i in seq_along(defined)) { :
    closing unused connection 4 (annotation.gtf)
  Execution halted

I would really appreciate any input in this.