WubingZhang / MAGeCKFlute

Integrative analysis pipeline for pooled CRISPR functional genetic screens
https://github.com/WubingZhang/MAGeCKFlute
23 stars 10 forks source link

Error in `[.data.frame`(dd, , symbol) : undefined columns selected #20

Open adbeggs opened 1 year ago

adbeggs commented 1 year ago

Hi

Great tool - thanks for writing it. I am running an RRA analysis and it crashes out with:

> FluteRRA(gene_summary = "SOX2.gene_summary.txt",sgrna_summary = "SOX2.sgrna_summary.txt",keytype = "Symbol",organism = "hsa",incorporateDepmap = TRUE)
2023-05-06 16:21:22 # Create output dir and pdf file ...
2023-05-06 16:21:22 # Read RRA result ...
2083 genes fail to convert into Entrez IDs: hsa-mir-3164, hsa-mir-26a-1, hsa-mir-4639, NonTargetingControlGuideForHuman_0175, hsa-mir-4254, hsa-mir-3943, hsa-mir-4635, hsa-mir-92a-1, NonTargetingControlGuideForHuman_0427, hsa-mir-299, hsa-mir-147b, hsa-mir-194-2, hsa-mir-3180-1, hsa-mir-3138, hsa-mir-3679, hsa-mir-520f, hsa-mir-3180-4, NonTargetingControlGuideForHuman_0743, 
117 genes have duplicate Entrez IDs: BRINP3, STRA13, FAM21B, TMEM48, CTSL1, GTDC2, C10orf12, BTBD8, HNRNPLL, CXorf30, CT45A4, KMT2D, PALM2-AKAP2, CSB-PGBD3, DBC1, C11orf93, PALM2, SPANXF1, C9orf47, DPH7, ACKR4, PION, MISP, HIST1H2BD, FOXD4L4, TXNRD3, ERMARD, SPATA31A1, EFTUD1, DPH6, NOV, NARR, PLEKHG7, CHDC2, MLL, EPRS, BET3L, CXCR7, C3orf37, MICALCL, LSMD1, MNF1, NEBL, PHYKPL, SMCR7, SPATA31A7, B3GNT2, KIAA1967, PRAMEF20, DUS2, FBXL19, C10orf68, ETNPPL, KIAA1704, DUSP27, HYAL1, GTPBP5, HIST1H2BI, C10orf114, HIST1H2BG, RBAK-LOC389458, SLA2, PRAC, RTCB, MLL5, CRAMP1L, CTSV, MLL2, C2orf47, WWTR1, SLC35E2B, CNIH, SPANXE, HIST1H2BE, POMK, BRINP2, C8orf42, CCBP2, NADK2, SEPT6, CARF, KIAA0317, ZBED6CL, BHLHE40, MYCL, DSCR6, GPER, FTSJD1, DMTN, C2orf48, HIST1H2BF, MUM1, LOR, CENPC1, NRROS, C10orf131, ANXA8L2, SMCR7L, SEPT4, ZFP106, SOLH, OSER1, MIA2, SLITRK2, SPIDR, MLL3, SPATA33, FTSJD2, SLC8B1, SHFM1, hsa-mir-548ba, HYKK, TBC1D31, TMEM261, GIF, DXO, GATSL2
snapshotDate(): 2022-10-31
see ?depmap and browseVignettes('depmap') for documentation
downloading 1 resources
retrieving 1 resource
  |======================================================================================| 100%

loading from cache
see ?depmap and browseVignettes('depmap') for documentation
downloading 1 resources
retrieving 1 resource
  |======================================================================================| 100%

loading from cache
2023-05-06 16:22:17 # Enrichment analysis of 9 Square grouped genes ...
2023-05-06 16:22:17 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    1835 genes are mapped ...
2023-05-06 16:22:24 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    313 genes are mapped ...
2023-05-06 16:22:28 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    166 genes are mapped ...
2023-05-06 16:22:31 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    198 genes are mapped ...
2023-05-06 16:22:35 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    2148 genes are mapped ...
2023-05-06 16:22:42 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    2001 genes are mapped ...
2023-05-06 16:22:48 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    511 genes are mapped ...
2023-05-06 16:22:53 # Running KEGG+REACTOME+GOBP+Complex enrichment analysis
    364 genes are mapped ...
Error in `[.data.frame`(dd, , symbol) : undefined columns selected
In addition: Warning message:
ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps 

Any suggestions? I can't see anything obvious I am doing wrong?

hzuzu commented 1 year ago

Hi,

I am running into same error using the below comamnd FluteRRA("demo.gene_summary.txt", "demo.sgrna_summary.txt", proj="demo", outdir = "MAGeCKFlute_results")

snapshotDate(): 2022-10-31 see ?depmap and browseVignettes('depmap') for documentation loading from cache snapshotDate(): 2022-10-31 see ?depmap and browseVignettes('depmap') for documentation loading from cache snapshotDate(): 2022-10-31 see ?depmap and browseVignettes('depmap') for documentation loading from cache snapshotDate(): 2022-10-31 see ?depmap and browseVignettes('depmap') for documentation loading from cache Error in [.data.frame(dd, , symbol) : undefined columns selected

However, if I use organism as "mmu" it works without errors FluteRRA("demo.gene_summary.txt", "demo.sgrna_summary.txt", organism = "mmu",proj="demo", outdir = "MAGeCKFlute_results")

But my data is from human!

yeminlan commented 11 months ago

Same. "mmu" would run through but "hsa" cannot.

felixm3 commented 8 months ago

I'm having the same issue. Has anyone been able to figure out a workaround?

My command:

## path to the gene summary file (required)
file1 = 'run4_3572.gene_summary.txt'
## path to the sgRNA summary file (optional)
file2 = 'run4_3572.sgrna_summary.txt'
# Run FluteRRA with both gene summary file and sgRNA summary file
FluteRRA(file1, file2, proj="shif122601", organism="hsa", outdir = "./")

The error message:

2023-12-26 13:37:06.356662 # Create output dir and pdf file ...

2023-12-26 13:37:06.362029 # Read RRA result ...

see ?depmap and browseVignettes('depmap') for documentation

loading from cache

see ?depmap and browseVignettes('depmap') for documentation

loading from cache

see ?depmap and browseVignettes('depmap') for documentation

loading from cache

see ?depmap and browseVignettes('depmap') for documentation

loading from cache

Error in `[.data.frame`(dd, , symbol): undefined columns selected
Traceback:

1. FluteRRA(file1, file2, proj = "shif122601", organism = "hsa", 
 .     outdir = "./")
2. OmitCommonEssential(dd.sgrna, symbol = "HumanGene")
3. dd[!(dd[, symbol] %in% lethal_genes), ]
4. `[.data.frame`(dd, !(dd[, symbol] %in% lethal_genes), )
5. dd[, symbol] %in% lethal_genes
6. dd[, symbol]
7. `[.data.frame`(dd, , symbol)
8. stop("undefined columns selected")

sessionInfo()


R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/fmbuga/.conda/envs/mageck2/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: US/Pacific
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] depmap_1.16.0          dplyr_1.1.4            ggplot2_3.4.4         
[4] clusterProfiler_4.10.0 MAGeCKFlute_2.6.0     

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3            jsonlite_1.8.8               
  [3] magrittr_2.0.3                farver_2.1.1                 
  [5] fs_1.6.3                      zlibbioc_1.48.0              
  [7] vctrs_0.6.5                   memoise_2.0.1                
  [9] RCurl_1.98-1.13               ggtree_3.10.0                
 [11] base64enc_0.1-3               htmltools_0.5.7              
 [13] AnnotationHub_3.10.0          curl_5.2.0                   
 [15] gridGraphics_0.5-1            plyr_1.8.9                   
 [17] cachem_1.0.8                  uuid_1.1-1                   
 [19] igraph_1.6.0                  mime_0.12                    
 [21] lifecycle_1.0.4               pkgconfig_2.0.3              
 [23] gson_0.1.0                    Matrix_1.6-4                 
 [25] R6_2.5.1                      fastmap_1.1.1                
 [27] GenomeInfoDbData_1.2.11       shiny_1.8.0                  
 [29] digest_0.6.33                 aplot_0.2.2                  
 [31] enrichplot_1.22.0             colorspace_2.1-0             
 [33] patchwork_1.1.3               AnnotationDbi_1.64.1         
 [35] S4Vectors_0.40.2              pathview_1.42.0              
 [37] ExperimentHub_2.10.0          RSQLite_2.3.4                
 [39] org.Hs.eg.db_3.18.0           filelock_1.0.3               
 [41] fansi_1.0.6                   httr_1.4.7                   
 [43] polyclip_1.10-6               compiler_4.3.1               
 [45] bit64_4.0.5                   withr_2.5.2                  
 [47] BiocParallel_1.36.0           viridis_0.6.4                
 [49] DBI_1.2.0                     ggforce_0.4.1                
 [51] MASS_7.3-60                   rappdirs_0.3.3               
 [53] HDO.db_0.99.1                 tools_4.3.1                  
 [55] scatterpie_0.2.1              ape_5.7-1                    
 [57] interactiveDisplayBase_1.40.0 httpuv_1.6.13                
 [59] glue_1.6.2                    nlme_3.1-164                 
 [61] GOSemSim_2.28.0               promises_1.2.1               
 [63] shadowtext_0.1.2              grid_4.3.1                   
 [65] pbdZMQ_0.3-10                 reshape2_1.4.4               
 [67] fgsea_1.28.0                  generics_0.1.3               
 [69] gtable_0.3.4                  tidyr_1.3.0                  
 [71] data.table_1.14.10            tidygraph_1.3.0              
 [73] utf8_1.2.4                    XVector_0.42.0               
 [75] BiocGenerics_0.48.1           ggrepel_0.9.4                
 [77] BiocVersion_3.18.1            pillar_1.9.0                 
 [79] stringr_1.5.1                 yulab.utils_0.1.2            
 [81] IRdisplay_1.1                 later_1.3.2                  
 [83] splines_4.3.1                 tweenr_2.0.2                 
 [85] treeio_1.26.0                 BiocFileCache_2.10.1         
 [87] lattice_0.22-5                bit_4.0.5                    
 [89] tidyselect_1.2.0              GO.db_3.18.0                 
 [91] Biostrings_2.70.1             gridExtra_2.3                
 [93] IRanges_2.36.0                stats4_4.3.1                 
 [95] graphlayouts_1.0.2            Biobase_2.62.0               
 [97] KEGGgraph_1.62.0              stringi_1.8.3                
 [99] lazyeval_0.2.2                ggfun_0.1.3                  
[101] yaml_2.3.8                    evaluate_0.23                
[103] codetools_0.2-19              ggraph_2.1.0                 
[105] tibble_3.2.1                  qvalue_2.34.0                
[107] Rgraphviz_2.46.0              BiocManager_1.30.22          
[109] graph_1.80.0                  ggplotify_0.1.2              
[111] cli_3.6.2                     IRkernel_1.3.2               
[113] xtable_1.8-4                  repr_1.1.6                   
[115] munsell_0.5.0                 Rcpp_1.0.11                  
[117] GenomeInfoDb_1.38.2           dbplyr_2.4.0                 
[119] png_0.1-8                     XML_3.99-0.16                
[121] parallel_4.3.1                ellipsis_0.3.2               
[123] blob_1.2.4                    DOSE_3.28.2                  
[125] bitops_1.0-7                  tidytree_0.4.6               
[127] viridisLite_0.4.2             scales_1.3.0                 
[129] purrr_1.0.2                   crayon_1.5.2                 
[131] rlang_1.1.2                   cowplot_1.1.2                
[133] fastmatch_1.1-4               KEGGREST_1.42.0             
hzuzu commented 7 months ago

I got a workaround for this issue by skipping few steps. I believe this error is coming from one function which is 'Computing the similarity between the CRISPR screen with Depmap screens' and then 'Omit common essential genes from the data' wrapped in the 'FluteRRA' function. Atleast for my objective I did not want the common essential genes removed, so I followed the MAGeCKFlute R package Documentation and ignored the functions in step 2.3.1 and 2.3.2, and the rest of the steps worked for me. The documentation has details of the 'FluteRRA' wrapper. If you want to use the wrapper function. This could work, however i haven't tried it. But this step will not use the Depmap screens and will not ommit the essential genes.

FluteRRA(file1, proj="Pmel1", organism="mmu", incorporateDepmap = FALSE, omitEssential = FALSE, outdir = "./")
sgt1796 commented 1 month ago

TL;DR It's because the "Gene" column in sgrna_summary.txt (output form mageck test) doesn't match the expected name "id" hardcoded in FluteRRA script.

I did some investigation and find this part of script in the FluteRRA() function that might causing the error:

if (omitEssential) {
    dd = OmitCommonEssential(dd, symbol = "HumanGene")
    dd.sgrna = OmitCommonEssential(dd.sgrna, symbol = "HumanGene")
    write.table(dd, file.path(outdir, paste0("RRA/", proj, 
                                             "_omit_essential.txt")), sep = "\t", row.names = FALSE, 
                quote = FALSE)
  }

I further looked into OmitCommonEssential function, it is called to obtain data from Depmap and process it. At the end of OmitCommonEssential(), there's a line that filter the input dataframe:

  lethal_genes = Selector(Depmap, -0.5, select = 0.9)$sig
  dd = dd[!(dd[, symbol] %in% lethal_genes), ]
  return(dd)

This line filters and keeps rows that are considered non-lethal. It will look for the column that named same with "HumanGene" ( Symbol = "HumanGene"). This is because the dd.sgrna suppose to have an id column that later assigned to HumanGene.

dd.sgrna$Symbol = dd.sgrna$id
...
...
dd.sgrna$HumanGene = dd.sgrna$Symbol

However, this is not true. sgrna_summary.txt output from mageck test generally have this column namd "Gene" instead of "id". And that causes this problem.

sgt1796 commented 1 month ago

What I did to resolve this issue was to manually overwrite the ReadsgRRA() function, add an id column in its output:

library(MAGeCKFlute)
ReadsgRRA = function (sgRNA_summary) 
{
  if (is.null(dim(sgRNA_summary))) {
    sgRNA_summary = read.table(file = sgRNA_summary, sep = "\t", 
                               header = TRUE, quote = "", comment.char = "", check.names = FALSE, 
                               stringsAsFactors = FALSE)
  }
  dd = sgRNA_summary[, c("sgrna", "Gene", "LFC", "FDR")]
  dd$id = dd$Gene
  return(dd)
}
assignInNamespace("ReadsgRRA", ReadsgRRA, ns = "MAGeCKFlute")
# Then FluteRRA() should work
Divery-cn commented 3 weeks ago

What I did to resolve this issue was to manually overwrite the ReadsgRRA() function, add an id column in its output:

library(MAGeCKFlute)
ReadsgRRA = function (sgRNA_summary) 
{
  if (is.null(dim(sgRNA_summary))) {
    sgRNA_summary = read.table(file = sgRNA_summary, sep = "\t", 
                               header = TRUE, quote = "", comment.char = "", check.names = FALSE, 
                               stringsAsFactors = FALSE)
  }
  dd = sgRNA_summary[, c("sgrna", "Gene", "LFC", "FDR")]
  dd$id = dd$Gene
  return(dd)
}
assignInNamespace("ReadsgRRA", ReadsgRRA, ns = "MAGeCKFlute")
# Then FluteRRA() should work

It worked, Thanks!!