ShixiangWang / sigminer

🌲 An easy-to-use and scalable toolkit for genomic alteration signature (a.k.a. mutational signature) analysis and visualization in R https://shixiangwang.github.io/sigminer/reference/index.html
https://shixiangwang.github.io/sigminer/
Other
147 stars 19 forks source link

NULL Row Names in NMF Matrix When Using sig_tally() with Wang Method #465

Open selkamand opened 2 months ago

selkamand commented 2 months ago

Hi @ShixiangWang,

Hope you're well.

Just noticed that when using the sig_tally() function with the Wang method, the row names in the resulting NMF matrix (tally_wang$nmf_matrix) are unexpectedly NULL. This behavior is inconsistent with the expected outcome, where the row names would ideally be the sample IDs, as observed when using the S (Steele) method. Reprex below:

# Load the sigminer library
library(sigminer)
#> Registered S3 method overwritten by 'sigminer':
#>   method      from
#>   print.bytes Rcpp
#> sigminer version 2.3.2
#> - Star me at https://github.com/ShixiangWang/sigminer
#> - Run hello() to see usage and citation.

# Load a toy segmentation table included with the sigminer package
load(system.file("extdata", "toy_segTab.RData",
                 package = "sigminer", mustWork = TRUE
))

# Set a seed for reproducibility
set.seed(1234)

# Add a new column 'minor_cn' with random values of either 0 or 1
segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE)

# Subset the segmentation table for a single sample
singleSampleSegTabs <- subset(segTabs, sample == "TCGA-A8-A07S-01A-11D-A036-01")

# Read the copy number data for the single sample
cn <- read_copynumber(singleSampleSegTabs,
                      seg_cols = c("chromosome", "start", "end", "segVal"),
                      genome_measure = "wg", complement = TRUE, add_loh = TRUE
)
#> ℹ [2024-09-05 14:58:13.990813]: Started.
#> ℹ [2024-09-05 14:58:13.995483]: Genome build  : hg19.
#> ℹ [2024-09-05 14:58:13.996023]: Genome measure: wg.
#> ℹ [2024-09-05 14:58:13.996513]: When add_loh is TRUE, use_all is forced to TRUE.
#> Please drop columns you don't want to keep before reading.
#> ✔ [2024-09-05 14:58:14.002538]: Chromosome size database for build obtained.
#> ℹ [2024-09-05 14:58:14.003157]: Reading input.
#> ✔ [2024-09-05 14:58:14.003672]: A data frame as input detected.
#> ✔ [2024-09-05 14:58:14.004327]: Column names checked.
#> ✔ [2024-09-05 14:58:14.005043]: Column order set.
#> ✔ [2024-09-05 14:58:14.006148]: Chromosomes unified.
#> ✔ [2024-09-05 14:58:14.008141]: Value 2 (normal copy) filled to uncalled chromosomes.
#> ✔ [2024-09-05 14:58:14.009723]: Data imported.
#> ℹ [2024-09-05 14:58:14.010245]: Segments info:
#> ℹ [2024-09-05 14:58:14.010774]:     Keep - 45
#> ℹ [2024-09-05 14:58:14.011272]:     Drop - 0
#> ✔ [2024-09-05 14:58:14.011938]: Segments sorted.
#> ℹ [2024-09-05 14:58:14.012445]: Adding LOH labels...
#> ℹ [2024-09-05 14:58:14.013251]: Joining adjacent segments with same copy number value. Be patient...
#> ✔ [2024-09-05 14:58:14.018461]: 39 segments left after joining.
#> ✔ [2024-09-05 14:58:14.019082]: Segmental table cleaned.
#> ℹ [2024-09-05 14:58:14.019575]: Annotating.
#> ✔ [2024-09-05 14:58:14.024584]: Annotation done.
#> ℹ [2024-09-05 14:58:14.02513]: Summarizing per sample.
#> ✔ [2024-09-05 14:58:14.033565]: Summarized.
#> ℹ [2024-09-05 14:58:14.034175]: Generating CopyNumber object.
#> ✔ [2024-09-05 14:58:14.034964]: Generated.
#> ℹ [2024-09-05 14:58:14.03548]: Validating object.
#> ✔ [2024-09-05 14:58:14.036014]: Done.
#> ℹ [2024-09-05 14:58:14.036588]: 0.046 secs elapsed.

# Tally signatures using the Wang method
tally_wang <- sigminer::sig_tally(cn, method = "Wang", keep_only_matrix = FALSE)
#> ℹ [2024-09-05 14:58:14.03747]: Started.
#> ℹ [2024-09-05 14:58:14.038619]: Step: getting copy number features.
#> ℹ [2024-09-05 14:58:14.076979]: Getting breakpoint count per 10 Mb...
#> ℹ [2024-09-05 14:58:14.104202]: Getting breakpoint count per chromosome arm...
#> ℹ [2024-09-05 14:58:14.111208]: Getting copy number...
#> ℹ [2024-09-05 14:58:14.112797]: Getting change-point copy number change...
#> ℹ [2024-09-05 14:58:14.118523]: Getting length of chains of oscillating copy number...
#> ℹ [2024-09-05 14:58:14.124469]: Getting (log10 based) segment size...
#> ℹ [2024-09-05 14:58:14.125993]: Getting the minimal number of chromosome with 50% CNV...
#> ℹ [2024-09-05 14:58:14.133015]: Getting burden of chromosome...
#> ✔ [2024-09-05 14:58:14.140732]: Gotten.
#> ℹ [2024-09-05 14:58:14.14139]: Step: generating copy number components.
#> ✔ [2024-09-05 14:58:14.141898]: `feature_setting` checked.
#> ℹ [2024-09-05 14:58:14.143294]: Step: counting components.
#> ✔ [2024-09-05 14:58:14.250663]: Counted.
#> ℹ [2024-09-05 14:58:14.251405]: Step: generating components by sample matrix.
#> ✔ [2024-09-05 14:58:14.252393]: Matrix generated.
#> ℹ [2024-09-05 14:58:14.252925]: 0.215 secs elapsed.

# Tally signatures using the Steele method
tally_steele <- sigminer::sig_tally(cn, method = "S", keep_only_matrix = FALSE)
#> ℹ [2024-09-05 14:58:14.253776]: Started.
#> ℹ [2024-09-05 14:58:14.25461]: When you use method 'S', please make sure you have set 'join_adj_seg' to FALSE and 'add_loh' to TRUE in 'read_copynumber() in the previous step!
#> ✔ [2024-09-05 14:58:14.260849]: Matrix generated.
#> ℹ [2024-09-05 14:58:14.261393]: 0.008 secs elapsed.

# ISSUE: The row names of the NMF matrix for the Wang tally are NULL,
# but they are expected to be the sample IDs.
rownames(tally_wang$nmf_matrix)
#> NULL

# EXPECTED BEHAVIOR: The row names for the Steele tally include the sample names as expected.
rownames(tally_steele$nmf_matrix)
#> [1] "TCGA-A8-A07S-01A-11D-A036-01"

# SessionInfo
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.4.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> time zone: Australia/Sydney
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] purrr_1.0.2         sigminer_2.3.2      Biobase_2.64.0     
#> [4] BiocGenerics_0.50.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.3      tidyr_1.3.1        future_1.34.0      utf8_1.2.4        
#>  [5] generics_0.1.3     stringi_1.8.4      listenv_0.9.1      digest_0.6.36     
#>  [9] magrittr_2.0.3     evaluate_0.24.0    grid_4.4.0         RColorBrewer_1.1-3
#> [13] iterators_1.0.14   fastmap_1.2.0      R.oo_1.26.0        foreach_1.5.2     
#> [17] doParallel_1.0.17  R.cache_0.16.0     plyr_1.8.9         R.utils_2.12.3    
#> [21] fansi_1.0.6        scales_1.3.0       codetools_0.2-20   cli_3.6.3         
#> [25] registry_0.5-1     rlang_1.1.4        R.methodsS3_1.8.2  parallelly_1.38.0 
#> [29] munsell_0.5.1      reprex_2.1.0       withr_3.0.1        yaml_2.3.8        
#> [33] tools_4.4.0        parallel_4.4.0     reshape2_1.4.4     dplyr_1.1.4       
#> [37] colorspace_2.1-1   ggplot2_3.5.1      rngtools_1.5.2     NMF_0.27          
#> [41] globals_0.16.3     vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4   
#> [45] stringr_1.5.1      fs_1.6.4           furrr_0.3.1        cluster_2.1.6     
#> [49] pkgconfig_2.0.3    pillar_1.9.0       gtable_0.3.5       data.table_1.15.4 
#> [53] glue_1.7.0         Rcpp_1.0.13        xfun_0.45          tibble_3.2.1      
#> [57] tidyselect_1.2.1   rstudioapi_0.16.0  knitr_1.47         htmltools_0.5.8.1 
#> [61] rmarkdown_2.27     compiler_4.4.0     gridBase_0.4-7

Created on 2024-09-05 with reprex v2.1.0

ShixiangWang commented 2 months ago

@selkamand Thanks for your potential bug report. I will take a look and fix it in a few days.