cnobles / iGUIDE

Bioinformatic pipeline for identifying dsDNA breaks by marker based incorporation, such as breaks induced by designer nucleases like Cas9.
https://iguide.readthedocs.io/en/latest/
GNU General Public License v3.0
20 stars 9 forks source link

Job execution fails at iguide_evaluation rule because no reads in incorp_sites.rds? #83

Closed ShanSabri closed 3 years ago

ShanSabri commented 3 years ago

I'm running into an issue where my iGUIDE processing job fails at the very last step. I've run into a similar issue before and solved by removing samples with no reads assigned (see this issue). But, this processing run does have reads assigned to each included library, see demux log below:

  sampleName      bc1              bc2 read_counts
1    IL26296 TAAGGCGA TAGATCGCNNWNNWNN      267763
2    IL26297 CGTACTAG CTCTCTATNNWNNWNN      321768
3    IL26298 AGGCAGAA TATCCTCTNNWNNWNN      260155
4    IL26299 TCCTGAGC AGAGTAGANNWNNWNN      303865

Ambiguous reads: 0
Degenerate reads: 316
Unassigned reads: 212137
Reads to be written to files: 1366004

Error in Snakemake log:

[Fri Jul  9 00:11:19 2021]
rule iguide_evaluation:
    input: /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/incorp_sites.IP280_revcomp_bc2.rds
    output: /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/iguide.eval.IP280_revcomp_bc2.rds, /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/IP280_revcomp_bc2.eval.stat
    log: /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/logs/IP280_revcomp_bc2.eval.log
    jobid: 46
    resources: mem_mb=48000

Job counts:
        count   jobs
        1       iguide_evaluation
        1
[Fri Jul  9 00:11:27 2021]
Error in rule iguide_evaluation:
    jobid: 0
    output: /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/iguide.eval.IP280_revcomp_bc2.rds, /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/IP280_revcomp_bc2.eval.stat
    log: /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/logs/IP280_revcomp_bc2.eval.log (check log file(s) for error message)

RuleException:
CalledProcessError in line 107 of /home/ubuntu/iGUIDE/rules/process.rules:
Command 'set -euo pipefail;  Rscript /home/ubuntu/iGUIDE/tools/rscripts/evaluate_incorp_data.R /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/config.yml -o /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/iguide.eval.IP280_revcomp_bc2.rds -s /home/ubuntu/iGUIDE_runs/IP280/sampleInfo/supp.csv --stat /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/IP280_revcomp_bc2.eval.stat > /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/logs/IP280_revcomp_bc2.eval.log 2>&1' returned non-zero exit status 1.
  File "/home/ubuntu/iGUIDE/rules/process.rules", line 107, in __rule_iguide_evaluation
  File "/home/ubuntu/.Anaconda3/envs/iguide/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/iGUIDE/.snakemake/log/2021-07-09T000025.010458.snakemake.log

When I read in /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/incorp_sites.IP280_revcomp_bc2.rds I find that the reads slots is empty:

...
...
$config$pileUpMin
[1] 3

$config$recoverMultihits
[1] FALSE

$config$suppFile
[1] TRUE

$config$tables
[1] FALSE

$config$figures
[1] TRUE

$config$reportData
[1] FALSE

$config$infoGraphic
[1] TRUE

$config$signature
[1] "Shan Sabri"

$reads
 [1] seqnames   start      end        width      strand     sampleName
 [7] ID         type       contrib    umitag
<0 rows> (or 0-length row.names)

I wonder if all these reads are somehow being filtered out during processing/alignment? I've checked the assim.log and it seems to be showing no aligned reads for each Specimen whereas my previous iGUIDE processing runs had this information populated:

Assimilate Inputs:
 Variables    Values
 uniqSites :  /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/unique_sites.IP280_revcomp_bc2.csv
 output :     /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/output/incorp_sites.IP280_revcomp_bc2.rds
 config :     /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/config.yml
 umitags :    /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/process_data/indices
 multihits :
 stat :       /home/ubuntu/iGUIDE/analysis/IP280_revcomp_bc2/process_data/stats/IP280_revcomp_bc2.assim.stat
 iguide_dir : /home/ubuntu/iGUIDE

Tabulation of aligned reads per specimen:
[1] Aligned_Reads
<0 rows> (or 0-length row.names)
Successfully completed script.

I've also checked the eval.log and there also appears to be an error after beginning the analysis:

Starting analysis...

Versioning:
[1] run_set       soft.version  build.version
<0 rows> (or 0-length row.names)
Error in mutate_impl(.data, dots) :
  Evaluation error: must be type logical, not double.
Calls: %>% ... <Anonymous> -> <Anonymous> -> mutate.tbl_df -> mutate_impl
Execution halted

Any insight would be greatly appreciated!

ShanSabri commented 3 years ago

The output of stats.core.csv suggest that no reads were aligned?

sampleName,align.multihit.clusters,align.multihit.lengths,align.multihit.reads,demulti.reads,filt.reads,R1.trim.reads,R2.primer.trim.reads,R2.trim.reads,umitags.reads
ambiguous_reads,NA,NA,NA,0,NA,NA,NA,NA,NA
degenerate_reads,NA,NA,NA,316,NA,NA,NA,NA,NA
IL26296,0,0,0,267763,0,232297,0,0,267763
IL26297,0,0,0,321768,0,271878,0,0,321768
IL26298,0,0,0,260155,0,223134,0,0,260155
IL26299,0,0,0,303865,0,255801,0,0,303865
unassigned_reads,NA,NA,NA,212137,NA,NA,NA,NA,NA
ShanSabri commented 3 years ago

Of note, these particular samples were sequenced on the a MiSeq. The MiSeq reads the i5 differently than other sequencers, the i5 is the reverse complement. So, I've specified this in the sampleInfo/sampleInfo.csv:

sampleName,barcode1,barcode2,gRNA
IL26296,TAAGGCGA,TAGATCGCNNWNNWNN,sgRNA_A
IL26297,CGTACTAG,CTCTCTATNNWNNWNN,sgRNA_B
IL26298,AGGCAGAA,TATCCTCTNNWNNWNN,sgRNA_C
IL26299,TCCTGAGC,AGAGTAGANNWNNWNN,sgRNA_D
ShanSabri commented 3 years ago

I wonder if the mappability is due to read quality? I have FastQC reports posted.

R1: Screen Shot 2021-07-08 at 7 57 03 PM

R2: Screen Shot 2021-07-08 at 7 57 33 PM

ShanSabri commented 3 years ago

There might be a concern that the guides assigned to each sample may be swapped. I wonder if this is the case then would it prohibit the alignment?

cnobles commented 3 years ago

Looks more like you are getting all your reads filtered out before alignment. Notice in your comment from the stats.core.csv file that you have demultiplexed reads (demux.reads) but filtered reads (filt.reads) is equal to zero for all the samples. When that gets broken down, it looks like R1 filtering is successful, but R2 seems to fail, returning zero reads. I would take a look at your primer sequences and the beginning of your R2 reads. The qualities are low, but you should still expect something to come through. It could very well be a mistaken sequence that is included during sequence trimming that isn't actually on your reads, or maybe you've pasted the reverse comp for the sequence. Possibly you have "N" sequences in your FASTQ files and none of the reads match the primer region within the error tolerance. Let me know if any of those pan out or if they don't. I'm sure we'll figure it out.

MatthewPace98 commented 1 year ago

Hi, I am having this same exact issue - had you found a solution to this, please?

Edit: Issue(s) solved. I had two main ones:

1) All my reads were getting filtered prior to alignment. To make filtering more lenient, I went into tools/rscripts/trim.R and reduced --minSeqLength to 20 and increased --badQualBases to 20.

2) My run name included a full stop character . which was preventing the sampleInfo and supp files from being read correctly.

cnobles commented 1 year ago

Hi Matthew,

Great that you were able to solve this on your own, I have a bit of a backlog with this repo and little time to develop currently. Just wanted to offer my opinions / suggestions, but just take them as such since there is a lot of wiggle room for informatic processing of DNA sequence data.

  1. Solid solution to dropping the minimal read length, but in practice mapping down to 20nt sequences may need to be accompanied by an increase in stringency for alignment mapping. For this you may want to adjust your alignment parameters in the config file (not sure if you are using BWA or BLAT). Additionally, I think this had little impact on your output if you didn't additionally change the minTempLength : 30 parameter in your config file. Maybe it got past the error of not having any reads, but where you able to process to completion? If you have a number of cycles though in your sequencing run that are just terrible quality, I can see how upping the --badQualBases would permit more reads to progress. I've ran into this issue a number of times with index reads having terrible quality and reviewing the fastq outputs from demux shows the single cycle across all reads with N. Not sure if that's the case here, but could be a reason for why your first solution worked.
  2. Breaking the naming rules messes with the way Snakemake manages the files and jobs, so that's also explainable. Other delimiters approaches work well, such as camel formatting names or using "_". Additionally the "-" can have different interpretation in the post-alignment processing, so check the documentation under the User Guide (Nomenclature and Semantics) as well as Sample Information Files. These have some good suggestions for how to optimize your naming to take advantage of the downstream analysis.

Best of luck, feel free to reach out again if you run into any issues.

Best, Chris