HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
66 stars 30 forks source link

filterSCE() version 1.22.0 #366

Open 83years opened 11 months ago

83years commented 11 months ago

Hi Helena,

I hope this finds you well.

I've just updated to the latest version and I think I may have spotted a bug. My dataset is made up of 4 conditions and I want to filter out the "Control" condition before running the clustering and DR.

#remove Controls
table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146        300000        600000        305666 

sce1 <- CATALYST::filterSCE(sce, condition != "Control")
table(sce1$condition)

 Group1 
1332146 

No matter what I negatively select for I always get the group1 data. The length of the data also remains the same when using sce1@colData with non-Group1 data being replaced by NAs.

When I positively select condition == "Control" I get table of extent 0 with all the data replaced by NAs.

Is this a bug or am I doing something wrong?

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggcyto_1.26.4               flowWorkspace_4.10.1        ncdfFlow_2.44.0             BH_1.81.0-1                 ComplexHeatmap_2.14.0       dplyr_1.1.2                
 [7] slingshot_2.6.0             TrajectoryUtils_1.6.0       princurve_2.1.6             ggplot2_3.4.3               RColorBrewer_1.1-3          cytofWorkflow_1.23.0       
[13] cowplot_1.1.1               uwot_0.1.16                 Matrix_1.6-1                HDCytoData_1.18.0           flowCore_2.10.0             ExperimentHub_2.6.0        
[19] AnnotationHub_3.6.0         BiocFileCache_2.6.1         dbplyr_2.3.3                diffcyt_1.18.0              CATALYST_1.22.0             SingleCellExperiment_1.20.1
[25] SummarizedExperiment_1.28.0 Biobase_2.58.0              GenomicRanges_1.50.2        GenomeInfoDb_1.34.9         IRanges_2.32.0              S4Vectors_0.36.0           
[31] BiocGenerics_0.44.0         MatrixGenerics_1.10.0       matrixStats_1.0.0           readxl_1.4.3                knitr_1.43                  BiocStyle_2.26.0           

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3                ragg_1.2.5                    tidyr_1.3.0                   bit64_4.0.5                   irlba_2.3.5.1                
  [6] multcomp_1.4-25               DelayedArray_0.24.0           data.table_1.14.8             KEGGREST_1.38.0               RCurl_1.98-1.12              
 [11] doParallel_1.0.17             generics_0.1.3                ScaledMatrix_1.6.0            TH.data_1.1-2                 RSQLite_2.3.1                
 [16] bit_4.0.5                     httpuv_1.6.11                 viridis_0.6.4                 xfun_0.40                     evaluate_0.21                
 [21] promises_1.2.1                fansi_1.0.4                   Rgraphviz_2.42.0              igraph_1.5.1                  DBI_1.1.3                    
 [26] purrr_1.0.2                   ellipsis_0.3.2                ggnewscale_0.4.9              ggpubr_0.6.0                  backports_1.4.1              
 [31] cytolib_2.10.1                ROI_1.0-1                     sparseMatrixStats_1.10.0      vctrs_0.6.3                   abind_1.4-5                  
 [36] cachem_1.0.8                  withr_2.5.0                   ggforce_0.4.1                 checkmate_2.2.0               cluster_2.1.4                
 [41] crayon_1.5.2                  drc_3.0-1                     edgeR_3.40.2                  pkgconfig_2.0.3               slam_0.1-50                  
 [46] labeling_0.4.2                tweenr_2.0.2                  nlme_3.1-160                  vipor_0.4.5                   rlang_1.1.1                  
 [51] lifecycle_1.0.3               sandwich_3.0-2                registry_0.5-1                filelock_1.0.2                rsvd_1.0.5                   
 [56] cellranger_1.1.0              polyclip_1.10-4               graph_1.76.0                  carData_3.0-5                 boot_1.3-28                  
 [61] zoo_1.8-12                    beeswarm_0.4.0                ggridges_0.5.4                GlobalOptions_0.1.2           pheatmap_1.0.12              
 [66] png_0.1-8                     viridisLite_0.4.2             rjson_0.2.21                  bitops_1.0-7                  ConsensusClusterPlus_1.62.0  
 [71] Biostrings_2.66.0             blob_1.2.4                    DelayedMatrixStats_1.20.0     shape_1.4.6                   stringr_1.5.0                
 [76] rstatix_0.7.2                 ggsignif_0.6.4                beachmat_2.14.2               scales_1.2.1                  memoise_2.0.1                
 [81] magrittr_2.0.3                plyr_1.8.8                    hexbin_1.28.3                 zlibbioc_1.44.0               compiler_4.2.2               
 [86] plotrix_3.8-2                 clue_0.3-64                   lme4_1.1-34                   cli_3.6.1                     XVector_0.38.0               
 [91] FlowSOM_2.6.0                 MASS_7.3-58.1                 tidyselect_1.2.0              stringi_1.7.12                RProtoBufLib_2.10.0          
 [96] textshaping_0.3.6             yaml_2.3.7                    BiocSingular_1.14.0           locfit_1.5-9.8                ggrepel_0.9.3                
[101] tools_4.2.2                   parallel_4.2.2                circlize_0.4.15               rstudioapi_0.15.0             foreach_1.5.2                
[106] gridExtra_2.3                 farver_2.1.1                  Rtsne_0.16                    digest_0.6.33                 BiocManager_1.30.22          
[111] shiny_1.7.5                   Rcpp_1.0.11                   car_3.1-2                     broom_1.0.5                   scuttle_1.8.4                
[116] BiocVersion_3.16.0            later_1.3.1                   httr_1.4.7                    AnnotationDbi_1.60.2          colorspace_2.1-0             
[121] XML_3.99-0.14                 splines_4.2.2                 scater_1.26.1                 systemfonts_1.0.4             xtable_1.8-4                 
[126] nloptr_2.0.3                  R6_2.5.1                      pillar_1.9.0                  htmltools_0.5.6               mime_0.12                    
[131] nnls_1.4                      glue_1.6.2                    fastmap_1.1.1                 minqa_1.2.5                   BiocParallel_1.32.6          
[136] BiocNeighbors_1.16.0          interactiveDisplayBase_1.36.0 codetools_0.2-18              mvtnorm_1.2-2                 utf8_1.2.3                   
[141] lattice_0.20-45               tibble_3.2.1                  numDeriv_2016.8-1.1           curl_5.0.2                    ggbeeswarm_0.7.2             
[146] colorRamps_2.3.1              gtools_3.9.4                  survival_3.4-0                limma_3.54.2                  rmarkdown_2.24               
[151] munsell_0.5.0                 GetoptLong_1.0.5              GenomeInfoDbData_1.2.9        iterators_1.0.14              reshape2_1.4.4               
[156] gtable_0.3.3   
83years commented 11 months ago

I have a work-around from the tidySingleCellExperiment package.

> sce <- filter(sce, condition != "Control")
> table(sce$condition)

       Group1       Control Healthy_Donor        Group2
      1332146             0        600000        305666 
purisuto commented 11 months ago

I have noticed that filterSCE() may drop some data if there are some empty entries in the initial metadata spreadsheet.

HelenaLC commented 8 months ago

Can't reproduce this currently... Could you post your metadata table? I.e., metadata(sce)$experiment_info? As opposed to the tidy solution, which only subsets based on the colData, filterSCE is specifically designed to work within the CATALYST framework. E.g., it will also assure metadata are in synch. For example, the experiment_info will be recomputed after filtering (using internal CATALYST:::.get_ei()), and cluster_codes will also be adjusted. The only thing going wrong here, I can image, is something with the sample_ids, which are needed for the experiment_info to be set up correctly.

purisuto commented 8 months ago

Sorry it's been a while since I encountered the issue. But one thing that seems to work for me is to replace ei() in filterSCE() with this modified function, which still relies on sample_ids().

ei2 <- function(sce, meta = c("patient_id"))

{
# Make a data frame of sample_ids
ns <- table(sample_id = sample_ids(sce))
df <- as.data.frame(ns)
m <- match(df$sample_id, sce$sample_id)

# Extract metadata
for (i in meta) df[[i]] <- sce[[i]][m]

return(df)
}
HelenaLC commented 8 months ago

Yeah, sure, that works in certain cases- the issue is that match will return the 1st matching entry, and there is no guarantee that sample_ids match uniquely to what's in the SCE's colData. This could result in discrepancies that may go unnoticed. E.g., cells from a given sample_id will have different cluster_ids- that's what's "special" with how .get_ei() constructs the metadata: only variables that match uniquely to sample_ids will be retained.