genomic-medicine-sweden / tomte

A nextflow pipeline for analysing expression and splicing in RNA seq data from rare disease patient
MIT License
11 stars 3 forks source link

Crash in OUTRIDER in DROP in `DROP_CONFIG_RUN_AE` when running single sample #136

Closed Jakob37 closed 2 months ago

Jakob37 commented 2 months ago

Description of the bug

Running the Tomte pipeline on a single paired RNA-seq dataset, I run into a crash in the DROP modules.

I am running in the master branch.

Error pasted in the box below.

I have poked around a bit in the results. It is a bit tricky to debug, as it looks like a Snakemake pipeline inside the Nextflow pipeline. But I have started getting a grip of what is happening.

Inside the filterCounts.R script in the DROP_CONFIG_RUN_AE process (in the aberrantExpression Snakemake workflow).

It seems to be crashing here:

ods <- filterExpression(ods, gtfFile=txdb, filter=FALSE,
                        fpkmCutoff=fpkmCutoff, addExpressedGenes=TRUE)

Running this locally, I got the following stacktrace (and the same error message as before).

> ods <- filterExpression(ods, gtfFile=txdb, filter=FALSE,
+                         fpkmCutoff=fpkmCutoff, addExpressedGenes=TRUE)
Error in colSums(cutoffPassedMatrix) : 
  'x' must be an array of at least two dimensions
> traceback()
8: stop("'x' must be an array of at least two dimensions")
7: colSums(cutoffPassedMatrix)
6: data.table(sampleID = colnames(cutoffPassedMatrix), expressedGenes = colSums(cutoffPassedMatrix))
5: computeExpressedGenes(fpkm, cutoff = fpkmCutoff, percentile = percentile)
4: filterExp(object, fpkmCutoff = fpkmCutoff, percentile = percentile, 
       filterGenes = filterGenes, savefpkm = savefpkm, addExpressedGenes = addExpressedGenes)
3: .local(object, ...)
2: filterExpression(ods, gtfFile = txdb, filter = FALSE, fpkmCutoff = fpkmCutoff, 
       addExpressedGenes = TRUE)
1: filterExpression(ods, gtfFile = txdb, filter = FALSE, fpkmCutoff = fpkmCutoff, 
       addExpressedGenes = TRUE)

Looking inside it, the preceeding code looks as follows:

cutoffPassedMatrix <- cutoffPassedMatrix[rowSums(cutoffPassedMatrix) > 0,]

# Make a data.table with the expressed genes
expGenesDt <- data.table(sampleID = colnames(cutoffPassedMatrix), 
        expressedGenes = colSums(cutoffPassedMatrix))

R has an unfortunate tendency to drop a matrix to a vector if only putting in one argument.

> matrix(c(1,2,3,4), ncol=2)[, 1]
[1] 1 2
> matrix(c(1,2,3,4), ncol=2)[, c(1,2)]
     [,1] [,2]
[1,]    1    3
[2,]    2    4

This would crash the colSums with the same error

> colSums(matrix(c(1,2,3,4), ncol=2)[, c(1,2)])
[1] 3 7
> colSums(matrix(c(1,2,3,4), ncol=2)[, c(1)])
Error in colSums(matrix(c(1, 2, 3, 4), ncol = 2)[, c(1)]) :
  'x' must be an array of at least two dimensions

So my hypothesis is that the filterExpression command isn't built to run only a single sample. Should this be possible, or am I using the pipeline the wrong way? Let me know if I am wrong on the ball here!

Command used and terminal output

The error:

ERROR ~ Error executing process > 'GENOMICMEDICINESWEDEN_TOMTE:TOMTE:ANALYSE_TRANSCRIPTS:DROP_CONFIG_RUN_AE (DROP_CONFIG_RUN_AE)'

Caused by:
  Process `GENOMICMEDICINESWEDEN_TOMTE:TOMTE:ANALYSE_TRANSCRIPTS:DROP_CONFIG_RUN_AE (DROP_CONFIG_RUN_AE)` terminated with an error exit status (1)

Command executed:

  TMPDIR=$PWD
  HOME=$PWD

  drop init

  /fs1/jakob/src/tomte/bin/drop_config.py \
      --genome_fasta GCA_000001405.15_GRCh38_no_alt_analysis_set_chr_masked.fna \
      --gtf gencode.v33.annotation.gtf \
      --drop_module AE \
      --genome_assembly GRCh38 \
      --drop_group_samples outrider \
      --padjcutoff 1 \
      --zscorecutoff 2.5 \
      --output config.yaml

  snakemake aberrantExpression --cores 56 --rerun-triggers mtime 

  cp output/processed_results/aberrant_expression/*/outrider/outrider/OUTRIDER_results_all.Rds .
  cp output/processed_data/preprocess/*/gene_name_mapping_*.tsv .

  cat <<-END_VERSIONS > versions.yml
  "GENOMICMEDICINESWEDEN_TOMTE:TOMTE:ANALYSE_TRANSCRIPTS:DROP_CONFIG_RUN_AE":
      drop_config: $($baseDir/bin/drop_config.py --version )
      drop: v$(echo $(drop --version) |  sed -n 's/drop, version //p')
  END_VERSIONS

Command exit status:
  1

Command output:
  TxDb object:
  # Db type: TxDb
  # Supporting package: GenomicFeatures
  # Data source: gencode.v33.annotation.gtf
  # Organism: NA
  # Taxonomy ID: NA
  # miRBase build ID: NA
  # Genome: NA
  # Nb of transcripts: 227912
  # Db created by: GenomicFeatures package from Bioconductor
  # Creation time: 2024-06-18 09:57:30 +0200 (Tue, 18 Jun 2024)
  # GenomicFeatures version at creation time: 1.52.1
  # RSQLite version at creation time: 2.3.1
  # DBSCHEMAVERSION: 1.2
  [1] TRUE
  [1] "2.351195 secs"
  [1] 36

Command error:
  rule AberrantExpression_pipeline_Counting_mergeCounts_R:
      input: output/processed_data/aberrant_expression/gencode.v33.annotation/counts/sample_id.Rds, output/processed_data/aberrant_expression/gencode.v33.annotation/count_ranges.Rds, output/processed_data/aberrant_expression/gencode.v33.annotation/params/merge/outrid$r_mergeParams.csv, Scripts/AberrantExpression/pipeline/Counting/mergeCounts.R
      output: output/processed_data/aberrant_expression/gencode.v33.annotation/outrider/outrider/total_counts.Rds
      log: .drop/tmp/AE/gencode.v33.annotation/outrider/merge.Rds
      jobid: 4
      reason: Missing output files: output/processed_data/aberrant_expression/gencode.v33.annotation/outrider/outrider/total_counts.Rds; Input files updated by another job: output/processed_data/aberrant_expression/gencode.v33.annotation/count_ranges.Rds, output/proc$ssed_data/aberrant_expression/gencode.v33.annotation/counts/sample_id.Rds
      wildcards: annotation=gencode.v33.annotation, dataset=outrider
      threads: 30
      resources: tmpdir=/mnt/beegfs/jakob_tmp/tomte/f2/6ed420caf4360c38e6e73a0ee4d1ef

  read 1 files
  [Tue Jun 18 09:58:47 2024]
  Finished job 4.
  7 of 16 steps (44%) done
  Select jobs to execute...

  [Tue Jun 18 09:58:47 2024]
  rule AberrantExpression_pipeline_Counting_filterCounts_R:
      input: output/processed_data/aberrant_expression/gencode.v33.annotation/outrider/outrider/total_counts.Rds, output/processed_data/preprocess/gencode.v33.annotation/txdb.db, Scripts/AberrantExpression/pipeline/Counting/filterCounts.R
      output: output/processed_results/aberrant_expression/gencode.v33.annotation/outrider/outrider/ods_unfitted.Rds
      log: .drop/tmp/AE/gencode.v33.annotation/outrider/filter.Rds
      jobid: 3
      reason: Missing output files: output/processed_results/aberrant_expression/gencode.v33.annotation/outrider/outrider/ods_unfitted.Rds; Input files updated by another job: output/processed_data/aberrant_expression/gencode.v33.annotation/outrider/outrider/total_counts.Rds, output/processed_data/preprocess/gencode.v33.annotation/txdb.db
      wildcards: annotation=gencode.v33.annotation, dataset=outrider
      resources: tmpdir=/mnt/beegfs/jakob_tmp/tomte/f2/6ed420caf4360c38e6e73a0ee4d1ef

  Warning messages:
  1: In DESeqDataSet(se, design = ~1, ...) :
    all genes have equal values for all samples. will not be able to perform differential analysis
  2: In OutriderDataSet(counts) :
    No sampleID was specified. We will generate a generic one.
  Error in colSums(cutoffPassedMatrix) : 
    'x' must be an array of at least two dimensions
  Calls: filterExpression ... filterExp -> computeExpressedGenes -> data.table -> colSums
  Execution halted
  [Tue Jun 18 09:59:06 2024]
  Error in rule AberrantExpression_pipeline_Counting_filterCounts_R:
      jobid: 3
      input: output/processed_data/aberrant_expression/gencode.v33.annotation/outrider/outrider/total_counts.Rds, output/processed_data/preprocess/gencode.v33.annotation/txdb.db, Scripts/AberrantExpression/pipeline/Counting/filterCounts.R
      output: output/processed_results/aberrant_expression/gencode.v33.annotation/outrider/outrider/ods_unfitted.Rds
      log: .drop/tmp/AE/gencode.v33.annotation/outrider/filter.Rds (check log file(s) for error details)
RuleException:
  CalledProcessError in file tmpt6cjtdrl, line 48:
  Command 'set -euo pipefail;  Rscript --vanilla .snakemake/scripts/tmp3ph21lgz.filterCounts.R' returned non-zero exit status 1.
    File "tmpt6cjtdrl", line 48, in __rule_AberrantExpression_pipeline_Counting_filterCounts_R
    File "/opt/conda/lib/python3.11/concurrent/futures/thread.py", line 58, in run
  Shutting down, this might take some time.
  Exiting because a job execution failed. Look above for error message
  Complete log: .snakemake/log/2024-06-18T095601.943298.snakemake.log


### Relevant files

_No response_

### System information

_No response_
Lucpen commented 2 months ago

Hi Jakob, I'm glad that you guys are trying to use tomte :) I will need some further information to help you debug:

Lucpen commented 2 months ago

Could you aslo try executing this from the folder where AE run? snakemake --cores 1 sampleAnnotation

Jakob37 commented 2 months ago

Hi Jakob, I'm glad that you guys are trying to use tomte :) I will need some further information to help you debug:

* what version of the pipeline are you running?

Looking at the CHANGELOG, the latest version with a filled in date is 2.0.1

Also to note, this is the first time poking around in DROP, so you might need to guide me further to get the information you want.

But, I'll do my best attempt first.

* Is the material from the DROP database you are using the same as that from your sample?

The txdb data used in the script at the top is indeed the same as that one used in the Nextflow, copied directly from the workfolder. Let me know if I misunderstood your question.

* How many samples do you have in the database?
* Could you check if the count matrix is empty?

Hmm. There is one column in the counts variable, it that answers your question (not sure whether I can see number of samples in the txdb).

> counts
class: RangedSummarizedExperiment 
dim: 60662 1 
metadata(0):
assays(1): counts
rownames(60662): ENSG00000000003.15 ENSG00000000005.6 ... ENSG00000288587.1 ENSG00000288588.1
rowData names(0):
colnames(1): sample_id
colData names(14): RNA_ID RNA_BAM_FILE ... GENE_ANNOTATION GENOME

For reference here is also the txdb:

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gencode.v33.annotation.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 227912
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2024-06-18 09:57:30 +0200 (Tue, 18 Jun 2024)
# GenomicFeatures version at creation time: 1.52.1
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2
* Could you check if the genome build, paired-end and strand configurations of the sample are the ones  indicated in the sample annotation?

Can you specify what sample annotation you point to here, and how to access it?

Jakob37 commented 2 months ago

Could you aslo try executing this from the folder where AE run? snakemake --cores 1 sampleAnnotation

Hmm. In the same working directory, I get the following error.

I am guessing this is a basic mistake on my part somewhere, due to not being used to Snakemake.

$ singularity exec /<path>/docker.io-clinicalgenomics-drop-1.3.3.img snakemake --cores 1 sampleAnnotation
FileExistsError in file /<work>/f2/6ed420caf4360c38e6e73a0ee4d1ef/Snakefile, line 12:
sample_annotation.tsv
  File "/<work>/f2/6ed420caf4360c38e6e73a0ee4d1ef/Snakefile", line 12, in <module>
  File "/opt/conda/lib/python3.11/site-packages/drop/config/DropConfig.py", line 31, in __init__
  File "/opt/conda/lib/python3.11/site-packages/drop/config/DropConfig.py", line 127, in setDefaults
  File "/opt/conda/lib/python3.11/site-packages/drop/utils.py", line 44, in checkKeys

This error persists also when moving the softlinked sample_annotation.tsv which resides in the same folder as the Snakemake file. And after renaming the one softlinked. Not sure where it is finding it.

I'll have to leave this for now, but will come back to Tomte soon.

Lucpen commented 2 months ago

Hi again, Yes, it always takes some time to get used too :)

Jakob37 commented 2 months ago

This error is still there for me when using a proper DROP reference dataset. I'll dig in a bit more and see if I can learn more about where the error is coming from.

Jakob37 commented 2 months ago

Running a set of five samples, it did not crash in the same place (I expect it has to do with me running only one sample). But it crashes later instead, without a clear error status.

I see that version 1.3.3 of DROP is used. Would it be possible to update this to the latest version 1.4.0? Before spending time on debugging an outdated version.

Also, have you guys managed to run real samples through DROP in Tomte? I would be interested in seeing how your input files look, if you are able to share.

Jakob37 commented 2 months ago

Here is the new error, still from the aberrant expression module:

Command output:
  # miRBase build ID: NA
  # Genome: NA
  # Nb of transcripts: 227912
  # Db created by: GenomicFeatures package from Bioconductor
  # Creation time: 2024-06-25 16:00:08 +0200 (Tue, 25 Jun 2024)
  # GenomicFeatures version at creation time: 1.52.1
  # RSQLite version at creation time: 2.3.1
  # DBSCHEMAVERSION: 1.2
  [1] TRUE
  [1] "38.86742 mins"
  [1] 61754677
  [1] "39.26456 mins"
  [1] 83414094
  [1] "1.135816 hours"
  [1] 140632607
  [1] "1.227175 hours"
  [1] 107577339
  [1] "1.515051 hours"
  [1] 173628357
  [1] TRUE
  1/13
  2/13 [unnamed-chunk-1]
  3/13
  4/13 [unnamed-chunk-2]
  5/13
  6/13 [unnamed-chunk-3]
  7/13 [meanCounts]     
  8/13
  9/13 [unnamed-chunk-4]
  10/13 [expressedGenes] 
  11/13
  12/13 [unnamed-chunk-5]
  [1] "Tue Jun 25 17:40:12 2024: Initial PCA loss: 7.69369834812004"
  [1] "Tue Jun 25 17:40:15 2024: Iteration: 1 loss: 6.64874291425983"
  [1] "Tue Jun 25 17:40:16 2024: Iteration: 2 loss: 6.62153902339786"
  [1] "Tue Jun 25 17:40:17 2024: Iteration: 3 loss: 6.60042599653787"
  [1] "Tue Jun 25 17:40:19 2024: Iteration: 4 loss: 6.58522274600442"
  [1] "Tue Jun 25 17:40:20 2024: Iteration: 5 loss: 6.57475038989527"
  [1] "Tue Jun 25 17:40:21 2024: Iteration: 6 loss: 6.56773156058167"
  [1] "Tue Jun 25 17:40:23 2024: Iteration: 7 loss: 6.56299199071875"
  [1] "Tue Jun 25 17:40:25 2024: Iteration: 8 loss: 6.55979920912951"
  [1] "Tue Jun 25 17:40:26 2024: Iteration: 9 loss: 6.55764272057436"
  [1] "Tue Jun 25 17:40:28 2024: Iteration: 10 loss: 6.55614316988807"
  [1] "Tue Jun 25 17:40:29 2024: Iteration: 11 loss: 6.5550629135103"
  [1] "Tue Jun 25 17:40:30 2024: Iteration: 12 loss: 6.55443956741798"
  [1] "Tue Jun 25 17:40:31 2024: Iteration: 13 loss: 6.55400965973263"
  [1] "Tue Jun 25 17:40:33 2024: Iteration: 14 loss: 6.5537021669489"
  [1] "Tue Jun 25 17:40:34 2024: Iteration: 15 loss: 6.55350416866071"
  Time difference of 21.12652 secs
  [1] "Tue Jun 25 17:40:34 2024: 15 Final nb-AE loss: 6.55350416866071"

Command error:
  Tue Jun 25 17:38:51 2024: Controlling for confounders ...
  Tue Jun 25 17:38:51 2024: Using the autoencoder implementation for controlling.
  [1] "Tue Jun 25 17:38:54 2024: Initial PCA loss: 7.63807922394072"
  [1] "Tue Jun 25 17:39:05 2024: Iteration: 1 loss: 5.19196309491786"
  [1] "Tue Jun 25 17:39:09 2024: Iteration: 2 loss: 5.19196303307466"
  Tue Jun 25 17:39:09 2024: the AE correction converged with:5.19196303307466
  Time difference of 8.645406 secs
  [1] "Tue Jun 25 17:39:09 2024: 2 Final nb-AE loss: 5.19196303307466"
  Tue Jun 25 17:39:10 2024: Used the autoencoder implementation for controlling.
  Tue Jun 25 17:39:10 2024: P-value calculation ...
  Tue Jun 25 17:39:10 2024: Zscore calculation ...
  [1] "Evaluation loss: 0.0145868508237345 for q=5"

  Tue Jun 25 17:38:51 2024: SizeFactor estimation ...
  Tue Jun 25 17:38:52 2024: Controlling for confounders ...
  Tue Jun 25 17:38:52 2024: Using the autoencoder implementation for controlling.
  [1] "Tue Jun 25 17:38:54 2024: Initial PCA loss: 7.7262947121687"
  [1] "Tue Jun 25 17:39:06 2024: Iteration: 1 loss: 6.70590027912551"
  [1] "Tue Jun 25 17:39:11 2024: Iteration: 2 loss: 6.68004689941878"
  [1] "Tue Jun 25 17:39:15 2024: Iteration: 3 loss: 6.66278146893879"
  [1] "Tue Jun 25 17:39:20 2024: Iteration: 4 loss: 6.64749462141839"
  [1] "Tue Jun 25 17:39:25 2024: Iteration: 5 loss: 6.63635757021486"
  [1] "Tue Jun 25 17:39:29 2024: Iteration: 6 loss: 6.62818792786937"
  [1] "Tue Jun 25 17:39:34 2024: Iteration: 7 loss: 6.62239444944122"
  [1] "Tue Jun 25 17:39:39 2024: Iteration: 8 loss: 6.61818368093265"
  [1] "Tue Jun 25 17:39:44 2024: Iteration: 9 loss: 6.61509587110696"
  [1] "Tue Jun 25 17:39:49 2024: Iteration: 10 loss: 6.61291339770892"
  [1] "Tue Jun 25 17:39:53 2024: Iteration: 11 loss: 6.611330508688"
  [1] "Tue Jun 25 17:39:56 2024: Iteration: 12 loss: 6.6100949529569"
  [1] "Tue Jun 25 17:40:00 2024: Iteration: 13 loss: 6.60930078454305"
  [1] "Tue Jun 25 17:40:03 2024: Iteration: 14 loss: 6.60877931537462"
  [1] "Tue Jun 25 17:40:07 2024: Iteration: 15 loss: 6.60838510676507"
  Time difference of 1.108333 mins
  [1] "Tue Jun 25 17:40:07 2024: 15 Final nb-AE loss: 6.60838510676507"
  Tue Jun 25 17:40:07 2024: Used the autoencoder implementation for controlling.
  Tue Jun 25 17:40:07 2024: P-value calculation ...
  Tue Jun 25 17:40:08 2024: Zscore calculation ...
  [1] "Evaluation loss: 0.0270543485425258 for q=2"

  Tue Jun 25 17:40:08 2024: SizeFactor estimation ...
  Tue Jun 25 17:40:08 2024: Controlling for confounders ...
  Tue Jun 25 17:40:08 2024: Using the autoencoder implementation for controlling.
  Tue Jun 25 17:40:34 2024: Used the autoencoder implementation for controlling.
  outrider fitting finished
  [Tue Jun 25 17:40:35 2024]
  Finished job 21.
  17 of 24 steps (71%) done
  Shutting down, this might take some time.
  Exiting because a job execution failed. Look above for error message
  Complete log: .snakemake/log/2024-06-25T155815.901121.snakemake.log

Work dir:
  /<work>/tomte/a1/e3ca6945218579ebc8ac0a6ecc61ee

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details
Jakob37 commented 2 months ago

My sample_annotation.tsv as seen in the work dir looks like the following (with much more control samples).

It might very well be that I have gotten something wrong here. Pointers are welcome.

RNA_ID  RNA_BAM_FILE    DNA_VCF_FILE    DNA_ID  DROP_GROUP      PAIRED_END      COUNT_MODE      COUNT_OVERLAPS  SPLICE_COUNTS_DIR       STRAND  HPO_TERMS       GENE_COUNTS_FILE        GENE_ANNOTATION GENOME  INDIVIDUAL_ID   TISSUE  SEX     STRAND_SPECIFIC
sample_1        sample_1.Aligned.sortedByCoord.out.bam  NA      NA      outrider,fraser True    IntersectionStrict      True    NA      reverse NA      NA      NA      NA      NA      NA      NA      NA
sample_2        sample_2.Aligned.sortedByCoord.out.bam  NA      NA      outrider,fraser True    IntersectionStrict      True    NA      reverse NA      NA      NA      NA      NA      NA      NA      NA
sample_5        sample_5.Aligned.sortedByCoord.out.bam  NA      NA      outrider,fraser True    IntersectionStrict      True    NA      reverse NA      NA      NA      NA      NA      NA      NA      NA
sample_4        sample_4.Aligned.sortedByCoord.out.bam  NA      NA      outrider,fraser True    IntersectionStrict      True    NA      reverse NA      NA      NA      NA      NA      NA      NA      NA
sample_3        sample_3.Aligned.sortedByCoord.out.bam  NA      NA      outrider,fraser True    IntersectionStrict      True    NA      reverse NA      NA      NA      NA      NA      NA      NA      NA
GTEX-111YS-0006-SM-5NQBE        NA      NA      NA      outrider,fraser True    IntersectionStrict      True    Whole_Blood--GRCh38--gencode29  no      NA      geneCounts.tsv.gz       NA      NA      GTEX-111YS      Whole_Blood     Male    False
GTEX-1122O-0005-SM-5O99J        NA      NA      NA      outrider,fraser True    IntersectionStrict      True    Whole_Blood--GRCh38--gencode29  no      NA      geneCounts.tsv.gz       NA      NA      GTEX-1122O      Whole_Blood     Female  False
GTEX-1128S-0005-SM-5P9HI        NA      NA      NA      outrider,fraser True    IntersectionStrict      True    Whole_Blood--GRCh38--gencode29  no      NA      geneCounts.tsv.gz       NA      NA      GTEX-1128S      Whole_Blood     Female  False
GTEX-113IC-0006-SM-5NQ9C        NA      NA      NA      outrider,fraser True    IntersectionStrict      True    Whole_Blood--GRCh38--gencode29  no      NA      geneCounts.tsv.gz       NA      NA      GTEX-113IC      Whole_Blood     Male    False
GTEX-113JC-0006-SM-5O997        NA      NA      NA      outrider,fraser True    IntersectionStrict      True    Whole_Blood--GRCh38--gencode29  no      NA      geneCounts.tsv.gz       NA      NA      GTEX-113JC      Whole_Blood     Female  False
...700+ more samples
Lucpen commented 2 months ago

Hi Jakob, did it run through AS but fail in AE or did it fail in both? Here is the documentation for running DROP within Tomte. Upon first site, remove the last column "STRAND_SPECIFIC" and see if it goes any better.

Yes, we have now run quite a few sample through the pipeline with Tomte, are the top 5 lines of the annotation file sampleAnnotation.txt we use, the samples 1-5 are from our local database (this is just the first lines). The samples run through Tomte are not in the annotation file because they will be added automatically by the pipeline.

We want to wait for a bit before changing to 1.4.0 as there might be patches in the coming weeks

Jakob37 commented 2 months ago

Hi Jakob, did it run through AS but fail in AE or did it fail in both?

Not sure yet, I had a timeout after 16h for AS and will try to rerun. But seems like it comes pretty far into the AS at least.

Here is the documentation for running DROP within Tomte. Upon first site, remove the last column "STRAND_SPECIFIC" and see if it goes any better.

Thank you! Sorry about not reading this carefully before starting issues 🙈 I'll try removing the STRAND_SPECIFIC column (this one is present in the originally downloaded data).

Yes, we have now run quite a few sample through the pipeline with Tomte, are the top 5 lines of the annotation file sampleAnnotation.txt we use, the samples 1-5 are from our local database (this is just the first lines). The samples run through Tomte are not in the annotation file because they will be added automatically by the pipeline.

OK great to know that you have managed to get a bunch of samples though. Then it sounds like some basic input mistakes on my side.

We want to wait for a bit before changing to 1.4.0 as there might be patches in the coming weeks

OK I understand!

Jakob37 commented 2 months ago

Another question. When running, do you typically run several of your samples together, or have you also tried running single samples (i.e. one sample together with the group of control samples)?

Lucpen commented 2 months ago

We have, it should work

Jakob37 commented 2 months ago

OK, I think I figured it out. I didn't correctly assign annotation files to the control samples. So when running, it retrieved only samples with the same annotation from existing samples, yielding only my actual sample. It started processing this one, but crashed in the steps above as it only had one sample to work with.

It would be helpful for DROP to throw an early error if detecting that there is only one sample ... But anyways, I got a good tour of the documentation and source code now. Thank you for the help on the way!

The AE module is still running after 16h though. Not sure if normal, but I think at least this issue is resolved.

Lucpen commented 2 months ago

Hi Jakob, Good to hear that it is working :). if you are running whole blood, it is likely due to high haemoglobin content. I would advise you to remove it, you can do so by providing tomte with a bed containing all haemoglobin positions as --subsample_bed.