crisprVerse / crisprDesign

Comprehensive design of CRISPR gRNAs for nucleases and base editors
MIT License
16 stars 5 forks source link

Validating Existing gRNA Libraries - Error in Off-Target Characterization (addSpacerAlignment) #19

Closed stefanusbernard closed 1 year ago

stefanusbernard commented 1 year ago

Hi, really appreciate for the tools provided by crisprVerse team. I tried to score different sgRNA libraries using Validating Existing gRNA Libraries tutorial. First, I used Avana library (70018 rows) and successfully generate the on and off target scoring. However, when I use Cellecta library (150076 rows), an error occurred in addSpacerAlignment function (Off-target characterization).

[runCrisprBowtie] Using BSgenome.Hsapiens.UCSC.hg38 
[runCrisprBowtie] Searching for SpCas9 protospacers 

reads processed: 149545
reads with at least one alignment: 149545 (100.00%)
reads that failed to align: 0 (0.00%)
Reported 6177820 alignments

Error in METHOD(x, i) : 
  Subsetting operation on CompressedGRangesList object 'x'
  produces a result that is too big to be represented as a
  CompressedList object. Please try to coerce 'x' to a SimpleList
  object first (with 'as(x, "SimpleList")').

The ensuing alignment generate large data (614520 rows), after subsequent data filtering and construction of guideset as mentioned in the tutorial, the resulting guideset consists of (231660 rows). Furthermore, I noticed this error similar to other package in #312 and #328. Kindly assists in this issue, any suggestion and advice would be appreciated.

This is my session info

R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reshape_0.8.9                     ggfortify_0.4.16                 
 [3] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3                  
 [5] Biostrings_2.66.0                 XVector_0.38.0                   
 [7] crisprDesignData_0.99.28          crisprViz_1.0.0                  
 [9] crisprDesign_1.0.0                crisprScore_1.2.0                
[11] crisprScoreData_1.2.0             ExperimentHub_2.6.0              
[13] AnnotationHub_3.6.0               BiocFileCache_2.6.1              
[15] dbplyr_2.3.2                      crisprBowtie_1.2.0               
[17] crisprBase_1.2.0                  crisprVerse_1.0.0                
[19] splitstackshape_1.4.8             rtracklayer_1.58.0               
[21] GenomicRanges_1.50.2              GenomeInfoDb_1.34.9              
[23] IRanges_2.32.0                    S4Vectors_0.36.2                 
[25] BiocGenerics_0.44.0               geno2proteo_0.0.6                
[27] patchwork_1.1.2                   hgnc_0.1.2                       
[29] data.table_1.14.8                 lubridate_1.9.2                  
[31] forcats_1.0.0                     stringr_1.5.0                    
[33] dplyr_1.1.1                       purrr_1.0.1                      
[35] readr_2.1.4                       tidyr_1.3.0                      
[37] tibble_3.2.1                      ggplot2_3.4.2                    
[39] tidyverse_2.0.0                   UniprotR_2.2.2                   

loaded via a namespace (and not attached):
  [1] utf8_1.2.3                    reticulate_1.28              
  [3] R.utils_2.12.2                RUnit_0.4.32                 
  [5] tidyselect_1.2.0              RSQLite_2.3.1                
  [7] AnnotationDbi_1.60.2          htmlwidgets_1.6.2            
  [9] grid_4.2.3                    BiocParallel_1.32.6          
 [11] airr_1.4.1                    munsell_0.5.0                
 [13] codetools_0.2-19              interp_1.1-4                 
 [15] withr_2.5.0                   colorspace_2.1-0             
 [17] Biobase_2.58.0                filelock_1.0.2               
 [19] knitr_1.42                    rstudioapi_0.14              
 [21] ggsignif_0.6.4                MatrixGenerics_1.10.0        
 [23] GenomeInfoDbData_1.2.9        bit64_4.0.5                  
 [25] basilisk_1.10.2               vctrs_0.6.1                  
 [27] generics_0.1.3                xfun_0.38                    
 [29] biovizBase_1.46.0             timechange_0.2.0             
 [31] randomForest_4.7-1.1          R6_2.5.1                     
 [33] AnnotationFilter_1.22.0       bitops_1.0-7                 
 [35] cachem_1.0.7                  DelayedArray_0.24.0          
 [37] vroom_1.6.1                   promises_1.2.0.1             
 [39] BiocIO_1.8.0                  networkD3_0.4                
 [41] scales_1.2.1                  nnet_7.3-18                  
 [43] gtable_0.3.3                  ensembldb_2.22.0             
 [45] rlang_1.1.0                   rstatix_0.7.2                
 [47] lazyeval_0.2.2                dichromat_2.0-0.1            
 [49] checkmate_2.1.0               broom_1.0.4                  
 [51] BiocManager_1.30.20           yaml_2.3.7                   
 [53] abind_1.4-5                   GenomicFeatures_1.50.4       
 [55] backports_1.4.1               httpuv_1.6.9                 
 [57] Hmisc_5.0-1                   tools_4.2.3                  
 [59] ellipsis_0.3.2                RColorBrewer_1.1-3           
 [61] Rcpp_1.0.10                   plyr_1.8.8                   
 [63] base64enc_0.1-3               progress_1.2.2               
 [65] zlibbioc_1.44.0               RCurl_1.98-1.12              
 [67] basilisk.utils_1.10.0         prettyunits_1.1.1            
 [69] deldir_1.0-6                  rpart_4.1.19                 
 [71] ggpubr_0.6.0                  cluster_2.1.4                
 [73] SummarizedExperiment_1.28.0   magrittr_2.0.3               
 [75] magick_2.7.4                  alakazam_1.2.1               
 [77] ProtGenerics_1.30.0           matrixStats_0.63.0           
 [79] evaluate_0.20                 hms_1.1.3                    
 [81] mime_0.12                     xtable_1.8-4                 
 [83] XML_3.99-0.14                 jpeg_0.1-10                  
 [85] gridExtra_2.3                 compiler_4.2.3               
 [87] biomaRt_2.54.1                crayon_1.5.2                 
 [89] R.oo_1.25.0                   htmltools_0.5.5              
 [91] later_1.3.0                   tzdb_0.3.0                   
 [93] Formula_1.2-5                 qdapRegex_0.7.5              
 [95] Rbowtie_1.38.0                DBI_1.1.3                    
 [97] gprofiler2_0.2.1              MASS_7.3-58.2                
 [99] rappdirs_0.3.3                data.tree_1.0.0              
[101] Matrix_1.5-3                  ade4_1.7-22                  
[103] car_3.1-2                     cli_3.6.1                    
[105] R.methodsS3_1.8.2             parallel_4.2.3               
[107] Gviz_1.42.1                   igraph_1.4.2                 
[109] pkgconfig_2.0.3               GenomicAlignments_1.34.1     
[111] dir.expiry_1.6.0              foreign_0.8-84               
[113] plotly_4.10.1                 xml2_1.3.3                   
[115] VariantAnnotation_1.44.1      digest_0.6.31                
[117] rmarkdown_2.21                htmlTable_2.4.1              
[119] restfulr_0.0.15               curl_5.0.0                   
[121] shiny_1.7.4                   Rsamtools_2.14.0             
[123] rjson_0.2.21                  lifecycle_1.0.3              
[125] nlme_3.1-162                  jsonlite_1.8.4               
[127] carData_3.0-5                 seqinr_4.2-30                
[129] viridisLite_0.4.1             fansi_1.0.4                  
[131] pillar_1.9.0                  ggsci_3.0.0                  
[133] lattice_0.20-45               KEGGREST_1.38.0              
[135] fastmap_1.1.1                 httr_1.4.5                   
[137] interactiveDisplayBase_1.36.0 glue_1.6.2                   
[139] png_0.1-8                     BiocVersion_3.16.0           
[141] bit_4.0.5                     stringi_1.7.12               
[143] blob_1.2.4                    latticeExtra_0.6-30          
[145] memoise_2.0.1                 ape_5.7-1
Jfortin1 commented 1 year ago

Thanks @stefanusbernard for reporting this! Would you be able to share your GuideSet object for the Cellecta library to give us a jump start? @ltHobbes Would you be able to help on this?

stefanusbernard commented 1 year ago

Hi is there any update about this issue? kindly let me know if there is an update.

Jfortin1 commented 1 year ago

@stefanusbernard We are working on it

Jfortin1 commented 1 year ago

@stefanusbernard The problem comes from the fact that many of the spacer sequences are repeated in the GuideSet(e.g. CACCTGTAATCCCAGCTACT), and those sequences have thousand of alignments. This results in a final alignment table that has more than 3 billion rows, which causes the error. I suggest to use addSpacerAlignmentsIterative (this worked for me) as it uses an early stop when a given gRNA has hundreds of off-targets.

stefanusbernard commented 1 year ago

Hi @Jfortin1 thanks for your help it works well for the addSpacerAlignmentsIterative. However, when I continue to add the on (addOnTargetScores) and off target scoring (addOffTargetScores), it results in the same error as the previous one. I understand about the repeated spacer sequences in the GuideSet as you mentioned before and I'd like to hear any suggestion from you as I am trying to score the whole library. Really appreciate and thanks again for the assistance from the CRISPRVerse team.

Jfortin1 commented 1 year ago

Hi @stefanusbernard, a simple solution here is to remove those promiscuous sgRNAs from the GuideSet upfront; there is a little value in further annotating those sgRNAs knowing that they map to thousands of loci.

stefanusbernard commented 1 year ago

Hi @Jfortin1 thanks for your assistance I managed to solve this issue. I will close this thread