gagneurlab / drop

Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders
MIT License
130 stars 43 forks source link

Error in: AberrantExpression_pipeline_Counting_filterCounts_R #270

Closed SathyaDarmalinggam closed 2 years ago

SathyaDarmalinggam commented 2 years ago

Hi there, I am trying to learn how to implement & run DROP by re-producing results from the 100 Geuvadis dataset. However, I keep running into an error with one of the steps for the Aberrant Expression module. I have read previous issues that highlight it could be a memory issue and hence have tried running it with various cores but to no avail. Here are some additional info:

Drop version: 1.1.0, installed as a singularity container.

Command run: snakemake --cores 5 --verbose --keep-going

Error message:

Error in rule AberrantExpression_pipeline_Counting_filterCounts_R:
    jobid: 29
    output: /DROP_Geuvadis/Output/processed_results/aberrant_expression/v38/outrider/all/ods_unfitted.Rds
    log: /DROP_Geuvadis/.drop/tmp/AE/v38/all/filter.Rds (check log file(s) for error message)

Full Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/snakemake/executors/__init__.py", line 593, in _callback
    raise ex
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/dist-packages/snakemake/executors/__init__.py", line 579, in cached_or_run
    run_func(*args)
  File "/usr/local/lib/python3.8/dist-packages/snakemake/executors/__init__.py", line 2460, in run_wrapper
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/snakemake/executors/__init__.py", line 2417, in run_wrapper
    run(
  File "/tmp/tmp_qjb73r3", line 130, in __rule_AberrantExpression_pipeline_Counting_filterCounts_R
    "/DROP_Geuvadis/Output/html/Scripts_AberrantExpression_pipeline_aberrant_expression_readme.html",
  File "/usr/local/lib/python3.8/dist-packages/snakemake/script.py", line 1369, in script
    executor.evaluate()
  File "/usr/local/lib/python3.8/dist-packages/snakemake/script.py", line 381, in evaluate
    self.execute_script(fd.name, edit=edit)
  File "/usr/local/lib/python3.8/dist-packages/snakemake/script.py", line 716, in execute_script
    self._execute_cmd("Rscript --vanilla {fname:q}", fname=fname)
  File "/usr/local/lib/python3.8/dist-packages/snakemake/script.py", line 414, in _execute_cmd
    return shell(
  File "/usr/local/lib/python3.8/dist-packages/snakemake/shell.py", line 265, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  Rscript --vanilla /DROP_Geuvadis/.snakemake/scripts/tmpajccybq0.filterCounts.R' returned non-zero exit status 1.

I have also attached the log file for you to look at further. As I am a beginner with bioinformatics & debugging, any help is much appreciated. 2021-11-04T182704.765588.snakemake.log

vyepez88 commented 2 years ago

Hi, from what I can see, the group small ran through, but the all did not. Besides providing enough cores, also sufficient memory is needed. Be sure that the cluster where you're running it has enough memory (at least 100Gb).

SathyaDarmalinggam commented 2 years ago

Hi vyepez88, thanks for looking through the issue promptly. I'm running this pipeline on a single machine that has > 1TB memory, so total memory isn't a problem. Is there perhaps a way I can specify somewhere in the config files where I could allocate more memory?

nickhsmith commented 2 years ago

Just to be clear (for the sake of all of our sanity), when we talk about memory we talk about RAM as opposed to storage space. The fact that the process worked for small but not all makes me think that it would be a memory issue.

So just for clarity, you can run (assuming your on a linux operating system) the following commands from the command line:

to see how much free memory you have available and how many CPUs you may have

If that is indeed 1TB of RAM and it's still failing we will try to figure out why the counting step seems to fail. I hope this helps!

SathyaDarmalinggam commented 2 years ago

Hi @nickhsmith , I ran your commands and these are the outputs.

free -m -h total used free shared buff/cache available Mem: 3.0T 141G 1.7T 69M 1.1T 2.8T

lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 176

Hope this helps.

nickhsmith commented 2 years ago

thanks! that should be more than enough memory!

One thing that would be useful in debugging would be try to run this job independent of the snakemake pipeline.

Thanks for your patience, and quick response times.

SathyaDarmalinggam commented 2 years ago

Hi @nickhsmith , thanks for your prompt replies & help! Here are the outputs as requested

snakemake@input
[[1]]
[1] "DROP_Geuvadis/Output/processed_data/aberrant_expression/v38/outrider/all/total_counts.Rds"

[[2]]
[1] "/DROP_Geuvadis/Output/processed_data/preprocess/v38/txdb.db"

[[3]]
[1] "Scripts/AberrantExpression/pipeline/Counting/filterCounts.R"

$counts
[1] "/DROP_Geuvadis/Output/processed_data/aberrant_expression/v38/outrider/all/total_counts.Rds"

$txdb
[1] "/DROP_Geuvadis/Output/processed_data/preprocess/v38/txdb.db"

$RScript
[1] "Scripts/AberrantExpression/pipeline/Counting/filterCounts.R"
source("/DROP_Geuvadis/Scripts/AberrantExpression/pipeline/Counting/filterCounts.R")
Error in .local(object, ...) :
  Every gene contains at least one zero, cannot compute log geometric means
In addition: Warning message:
In OutriderDataSet(counts) :
  No sampleID was specified. We will generate a generic one.
vyepez88 commented 2 years ago

Hi @SathyaDarmalinggam, we've now reached the error:

  Every gene contains at least one zero, cannot compute log geometric means

This should not be the case for the GEUVADIS samples. First, check that the following are the values of the columns relevant for Counting in the sample annotation:

PAIRED_END         COUNT_MODE       COUNT_OVERLAPS STRAND
TRUE            IntersectionStrict       TRUE       no

Maybe there was a problem with the counting of one sample. Please execute the following:

counts <- readRDS("/lustre/groups/itg/teams/kim-hellmuth/users/sathya.darmalinggam/DROP_Geuvadis/Output/processed_data/aberrant_expression/v38/outrider/all/total_counts.Rds")
plot(sort(colSums(assay(counts))), ylab = 'Total Counts', xlab = 'Sorted samples')

It should look like the plot below. If not, try to find the sample with low counts (the columns of the count matrix correspond to each sample). If you find the culprit, delete its counts (processed_data/aberrant_expression/{annotation}/counts/{sample}.Rds) and rerun the pipeline. image

SathyaDarmalinggam commented 2 years ago

Hi @vyepez88 , thanks for the detailed explanation. While doing these checks, I realised I made a rookie mistake of using the wrong genome build. So I re-ran the pipeline on the Geuvadis dataset using hg19, but obtained similar errors as above. Here is what the count plot looks like:

image

Looks like there is more than 1 sample with low/zero count? Should I be removing all of them?

Just to be more precise, I downloaded the bam files from: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/processed/?ref=E-GEUV-1. I am using hg19 & gencode.v19.annotation.gtf files.

vyepez88 commented 2 years ago

It seems to be a couple of samples with 0 total counts. For the analysis, I would remove all samples with less than 10M total counts. We also downloaded the samples from there, but maybe other 100. We used this script: https://github.com/gagneurlab/RNAseq-ASHG19/blob/master/src/prepareData/download_data.R Try simply removing those 20 samples to see whether DROP runs through.

SathyaDarmalinggam commented 2 years ago

Hi @vyepez88 , I was able to get the aberrant expression module running by removing the low count samples as suggested. I am now however facing errors with the aberrant splicing module. I used the same debugging method as described above and got this:

> snakemake <- readRDS("<path>/.drop/tmp/AS/small--v37/FRASER_summary.Rds")
> snakemake@input
[[1]]
[1] "<path>/processed_results/aberrant_splicing/datasets/savedObjects/small--v37/fds-object.RDS"

[[2]]
[1] "<path>/processed_results/aberrant_splicing/results/v37/fraser/small/results.tsv"

[[3]]
[1] "Scripts/AberrantSplicing/pipeline/FRASER/Summary.R"

$fdsin
[1] "<path>/processed_results/aberrant_splicing/datasets/savedObjects/small--v37/fds-object.RDS"

$results
[1] "<path>/processed_results/aberrant_splicing/results/v37/fraser/small/results.tsv"

$RScript
[1] "Scripts/AberrantSplicing/pipeline/FRASER/Summary.R"

> source("<path>/Scripts/AberrantSplicing/pipeline/FRASER/Summary.R")
Load packages
Loading assay: rawCountsJ
Loading assay: psi5
Loading assay: psi3
Loading assay: rawOtherCounts_psi5
Loading assay: rawOtherCounts_psi3
Loading assay: delta_psi5
Loading assay: delta_psi3
Loading assay: predictedMeans_psi5
Loading assay: predictedMeans_psi3
Loading assay: zScores_psi5
Loading assay: pvaluesBetaBinomial_junction_psi5
Loading assay: pvaluesBetaBinomial_psi5
Loading assay: padjBetaBinomial_psi5
Loading assay: zScores_psi3
Loading assay: pvaluesBetaBinomial_junction_psi3
Loading assay: pvaluesBetaBinomial_psi3
Loading assay: padjBetaBinomial_psi3
Loading assay: rawCountsSS
Loading assay: theta
Loading assay: rawOtherCounts_theta
Loading assay: delta_theta
Loading assay: predictedMeans_theta
Loading assay: zScores_theta
Loading assay: pvaluesBetaBinomial_junction_theta
Loading assay: pvaluesBetaBinomial_theta
Loading assay: padjBetaBinomial_theta
Error in fread(snakemake@input$results) :
  Input is either empty, fully whitespace, or skip has been set after the last non-whitespace.

Any help/advice is much appreciated.

vyepez88 commented 2 years ago

Does this file exist: <path>/processed_results/aberrant_splicing/results/v37/fraser/small/results.tsv? If not, you can regenerate it by executing snakemake <path>/processed_results/aberrant_splicing/results/v37/fraser/small/results.tsv

SathyaDarmalinggam commented 2 years ago

Hi @vyepez88, I regenerated the file using your command but get this warning message Warning message: The aberrant splicing pipeline gave 0 results for the small dataset. I proceeded to run the module again anyway, and it throws me the same error as above.

I've attached the log file from executing this: snakemake <path>/processed_results/aberrant_splicing/results/v37/fraser/small/results.tsv 2021-11-15T232136.144857.snakemake.log

vyepez88 commented 2 years ago

This means that there were no significant results for the small group using the cutoffs that you defined in the config file. If you change the padjCutoff to 1, the pipeline should run through.

SathyaDarmalinggam commented 2 years ago

Hi @vyepez88 & @nickhsmith , the pipeline runs through now. Thanks for your support!