bhklab / genefu

R package providing various functions relevant for gene expression analysis with emphasis on breast cancer.
25 stars 13 forks source link

Centroids of PAM50 and Different result for TCGA-BRCA RNA-seq data #27

Closed calisirkubra closed 2 years ago

calisirkubra commented 2 years ago

I am using the genefu package to subtype my breast cancer RNA-seq data to predict subtypes. For that purpose, I use pam50.robust data. However, I realized that the centroid values changed. I tried it before (almost one month ago) and saved the centroids, and I realized that the values are not the same as the present ones. I could not understand the reason since it should use the official centroids from Parker, 2009, as far as I know.

Also, another question is that I subtyped TCGA-BRCA data with pam50.robust, and It did not give the same result with TCGABioLinks which was also subtyped with PAM50 (not using genefu package). Does the prediction depend on the data? Or genefu package creates some differences compared to PAM50 alone?

ChristopherEeles commented 2 years ago

Hi @calisirkubra,

Without inclusion of a reproducible example demonstrating the change in centroid values or cluster assignments, it is difficult for me to help you with that issue. We have not made changes to the included package data in the last year nor have we made any changes to code implemented in the package. I would also need to know information about your R environment, as returned by the sessionInfo function.

Regarding TCGABioLinks, I am not familiar with that package nor the methods they employ for their sub-typing analysis. If you think there is an issue with their implementation I would recommend opening an issue on their repository. I suspect they will also request a minimally reproducible example to investigate any bugs you may have found.

As noted in #22 and #26, the genefu package was designed primarily for subtyping Affymetrix microarray data, and while it may produce useful result from RNA sequencing data that is something that would require justification and a careful consideration of differences in the design of the microarray platform used to derive the original signature and the RNA sequencing experiment use for your specific data. We do not recommend simply plugging data into the package an expecting useful result, you will need to apply your scientific best judgment as to whether the results are correct or useful outside the original microarray context this package as designed for.

Best, Christopher Eeles Software Developer Haibe-Kains Lab | PM-Research University Health Network

calisirkubra commented 2 years ago

Dear @ChristopherEeles ,

You can find the centroids and session info below. I have additionally two other questions.

1) Aren't these centroids should be same whatever data I use since they are the official centroids from Parker 2009? 2) What are your suggestions for RNA-SEQ data adjustment for genefu package?

pam50.robust$centroids (now) Basal Her2 LumA LumB Normal ACTR3B 0.71833189 -0.48166568 0.009981070 -0.19055133 0.46572287 ANLN 0.53737230 0.26693161 -0.579245716 0.09880418 -0.83693959 BAG1 -0.57450687 -0.47607287 0.758221161 -0.40545862 0.31655297 BCL2 -0.11876043 -0.15791396 0.287487440 -0.44133950 0.53397887 BIRC5 0.30048864 0.40573310 -0.881434366 0.60385078 -0.87663642 BLVRA -0.64267751 0.33533604 0.042042017 0.69120496 -0.16341281 CCNB1 0.19120814 0.13547665 -0.491662114 0.50317636 -0.54526931 CCNE1 0.56027103 0.06687223 -0.430291227 -0.01666143 -0.25547606 CDC20 0.39969524 0.00835552 -0.469044010 -0.07041247 -0.04550481 CDC6 0.15941828 0.58900682 -0.612824305 0.51089597 -0.59575217 CDCA1 0.47240017 -0.02381921 -0.712520819 0.58962688 -0.37053337 CDH3 0.50836201 0.21088969 -0.513649344 -1.41913444 0.75792062 CENPF 0.48297629 -0.02926616 -0.543740234 0.27822856 -0.07058307 CEP55 0.56774889 0.27638102 -0.746721735 0.46001576 -1.16237419 CXXC5 -0.92038581 -0.24155061 0.467411571 0.32133502 0.05090144 EGFR -0.03041685 -0.09638262 0.009162963 -0.41240126 0.34163708 ERBB2 -0.80835398 1.75984423 0.608191264 0.15965187 -0.87023846 ESR1 -2.74651309 -1.51311125 2.161411882 1.60589991 -0.41828235 EXO1 0.42809036 0.04929719 -0.567474505 0.14124128 -0.45078054 FGFR4 -0.27123802 0.82177815 0.170811925 -0.24703604 0.85747278 FOXA1 -2.62694672 0.02282715 1.017457421 0.36075779 -0.78281211 FOXC1 1.49045147 -0.94717419 -0.174957960 -1.56485496 1.11154786 GPR160 -1.05497467 0.58319483 0.685489973 0.71440760 -0.42356847 GRB7 -0.27612859 1.03065778 0.041568986 0.08775089 0.24171099 KIF2C 0.20357258 -0.16510205 -0.505394668 -0.18289071 -0.39001448 KNTC2 0.60035617 0.04254679 -0.588220989 0.38670684 -1.06962886 KRT14 0.09682672 -0.44364614 0.368375943 -0.63944697 1.73568631 KRT17 0.48256553 -0.33783710 0.014209862 -1.46374293 1.75959844 KRT5 0.50664042 -0.42826178 0.215320068 -0.91160727 1.78511690 MAPT -0.42582927 -0.35750654 0.700622718 -0.19034057 0.11782850 MDM2 -0.25136621 -0.10672868 0.141957430 -0.13377904 0.27421401 MELK 0.52303387 0.19801311 -0.582088108 0.44793463 -0.74376468 MIA 1.57827637 -0.90489862 -0.165258584 -1.42292627 2.03885956 MKI67 0.47653745 0.06566236 -0.501871622 -0.14521787 -0.16600406 MLPH -0.33997246 -0.19522866 0.339304418 -0.45614992 0.75075837 MMP11 -0.55603767 0.50675876 -0.006255090 0.33419931 -2.32698512 MYBL2 0.38989345 0.20526358 -0.843569931 0.46728199 -0.60170475 MYC 0.17876381 -1.04683283 -0.090830821 0.01526440 1.02917620 NAT1 -0.93684895 -0.08998849 2.922786792 0.47078804 -0.36327376 ORC6L 0.21630480 0.20440245 -0.352220667 0.11062765 -0.25587949 PGR -0.42913339 -0.27940992 0.445785003 -0.44883984 0.12601148 PHGDH 0.63451887 -0.18662586 -0.398682234 -1.03013932 0.66043775 PTTG1 0.26413189 0.05580989 -0.634468270 0.24972528 -0.54978126 RRM2 0.15620468 0.68272489 -0.950760200 0.35066384 -1.12105493 SFRP1 0.98798846 -1.04820267 0.131566364 -1.72045826 2.43628867 SLC39A6 -1.05112505 -0.69573646 2.061459075 1.65330302 0.11688969 TMEM45B -1.10945818 1.33063617 0.446242045 0.37568823 0.03620891 TYMS 0.44980090 0.05294490 -0.644602075 0.49260652 -0.72698945 UBE2C 0.21853415 0.06108060 -0.519818399 0.29279931 -0.40889468 UBE2T 0.38990890 0.28453681 -0.539259391 0.73895213 -0.95238101

pam50.robust$centroids (one moth ago)   | Basal | Her2 | LumA | LumB | Normal ACTR3B | 0.718331891 | -0.481665675 | 0.00998107 | -0.190551328 | 0.465722871 ANLN | 0.537372301 | 0.266931609 | -0.579245716 | 0.098804179 | -0.836939593 BAG1 | -0.574506867 | -0.476072868 | 0.758221161 | -0.405458622 | 0.316552973 BCL2 | -0.11876043 | -0.157913959 | 0.28748744 | -0.441339498 | 0.533978871 BIRC5 | 0.300488641 | 0.405733099 | -0.881434366 | 0.603850777 | -0.876636424 BLVRA | -0.642677513 | 0.335336041 | 0.042042017 | 0.691204962 | -0.163412812 CCNB1 | 0.191208143 | 0.135476652 | -0.491662114 | 0.503176358 | -0.545269312 CCNE1 | 0.560271028 | 0.066872232 | -0.430291227 | -0.01666143 | -0.255476058 CDC20 | 0.399695242 | 0.00835552 | -0.46904401 | -0.070412466 | -0.04550481 CDC6 | 0.159418279 | 0.58900682 | -0.612824305 | 0.510895969 | -0.595752175 CDCA1 | 0.472400168 | -0.023819207 | -0.712520819 | 0.589626883 | -0.370533365 CDH3 | 0.508362012 | 0.210889692 | -0.513649344 | -1.419134437 | 0.757920624 CENPF | 0.482976288 | -0.02926616 | -0.543740234 | 0.278228556 | -0.070583075 CEP55 | 0.567748894 | 0.276381022 | -0.746721735 | 0.460015762 | -1.162374186 CXXC5 | -0.920385813 | -0.241550612 | 0.467411571 | 0.32133502 | 0.050901436 EGFR | -0.030416849 | -0.096382621 | 0.009162963 | -0.412401259 | 0.341637082 ERBB2 | -0.80835398 | 1.759844231 | 0.608191264 | 0.159651874 | -0.870238456 ESR1 | -2.746513086 | -1.513111253 | 2.161411882 | 1.605899914 | -0.418282349 EXO1 | 0.428090356 | 0.049297194 | -0.567474505 | 0.141241282 | -0.450780537 FGFR4 | -0.271238025 | 0.821778152 | 0.170811925 | -0.247036038 | 0.857472777 FOXA1 | -2.626946721 | 0.022827151 | 1.017457421 | 0.360757795 | -0.782812106 FOXC1 | 1.490451472 | -0.947174192 | -0.17495796 | -1.564854964 | 1.111547864 GPR160 | -1.054974672 | 0.583194826 | 0.685489973 | 0.714407601 | -0.423568467 GRB7 | -0.276128586 | 1.03065778 | 0.041568986 | 0.087750887 | 0.241710991 KIF2C | 0.20357258 | -0.165102048 | -0.505394668 | -0.182890713 | -0.390014484 KNTC2 | 0.600356167 | 0.042546792 | -0.588220989 | 0.386706844 | -1.06962886 KRT14 | 0.096826723 | -0.443646142 | 0.368375943 | -0.639446966 | 1.73568631 KRT17 | 0.482565528 | -0.337837101 | 0.014209862 | -1.463742934 | 1.759598437 KRT5 | 0.506640416 | -0.428261778 | 0.215320068 | -0.91160727 | 1.785116895 MAPT | -0.425829273 | -0.357506541 | 0.700622718 | -0.190340574 | 0.117828498 MDM2 | -0.251366205 | -0.106728681 | 0.14195743 | -0.133779037 | 0.274214011 MELK | 0.523033872 | 0.198013115 | -0.582088108 | 0.44793463 | -0.743764676 MIA | 1.578276368 | -0.904898621 | -0.165258584 | -1.422926267 | 2.038859556 MKI67 | 0.476537448 | 0.06566236 | -0.501871622 | -0.145217868 | -0.166004063 MLPH | -0.339972459 | -0.195228658 | 0.339304418 | -0.456149915 | 0.750758365 MMP11 | -0.556037672 | 0.50675876 | -0.00625509 | 0.334199309 | -2.326985119 MYBL2 | 0.389893452 | 0.20526358 | -0.843569931 | 0.46728199 | -0.601704754 MYC | 0.178763812 | -1.046832832 | -0.090830821 | 0.015264397 | 1.029176203 NAT1 | -0.936848946 | -0.089988492 | 2.922786792 | 0.470788042 | -0.363273764 ORC6L | 0.216304797 | 0.204402449 | -0.352220667 | 0.11062765 | -0.255879493 PGR | -0.429133389 | -0.279409916 | 0.445785003 | -0.448839844 | 0.126011482 PHGDH | 0.63451887 | -0.186625862 | -0.398682234 | -1.030139318 | 0.660437753 PTTG1 | 0.264131894 | 0.055809895 | -0.63446827 | 0.249725281 | -0.54978126 RRM2 | 0.156204676 | 0.682724889 | -0.9507602 | 0.35066384 | -1.12105493 SFRP1 | 0.987988459 | -1.048202667 | 0.131566364 | -1.720458262 | 2.436288668 SLC39A6 | -1.051125052 | -0.695736457 | 2.061459075 | 1.653303022 | 0.116889694 TMEM45B | -1.109458181 | 1.330636172 | 0.446242045 | 0.375688226 | 0.036208914 TYMS | 0.449800897 | 0.052944897 | -0.644602075 | 0.492606521 | -0.726989454 UBE2C | 0.218534147 | 0.061080598 | -0.519818399 | 0.292799306 | -0.408894685 UBE2T | 0.389908898 | 0.284536813 | -0.539259391 | 0.738952133 | -0.952381005

sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.0.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 grid stats graphics grDevices utils datasets methods base

other attached packages: [1] htmlwidgets_1.5.4 randomForest_4.7-1 png_0.1-7
[4] heatmaply_1.3.0 viridis_0.6.2 viridisLite_0.4.0
[7] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4
[10] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[13] tidyverse_1.3.1 plotly_4.10.0 gplots_3.1.3
[16] edgeR_3.36.0 limma_3.50.3 DESeq2_1.34.0
[19] SummarizedExperiment_1.24.0 MatrixGenerics_1.6.0 matrixStats_0.62.0
[22] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 IRanges_2.28.0
[25] S4Vectors_0.32.4 InteractiveComplexHeatmap_1.2.0 shinycssloaders_1.0.0
[28] dplyr_1.0.9 ComplexHeatmap_2.10.0 RColorBrewer_1.1-3
[31] DT_0.23 shinydashboard_0.7.2 shiny_1.7.1
[34] TCGAbiolinks_2.22.4 caret_6.0-92 lattice_0.20-45
[37] ggplot2_3.3.6 rmeta_3.0 xtable_1.8-4
[40] genefu_2.26.0 AIMS_1.26.0 Biobase_2.54.0
[43] BiocGenerics_0.40.0 e1071_1.7-9 iC10_1.5
[46] iC10TrainingData_1.3.1 impute_1.68.0 pamr_1.56.1
[49] cluster_2.1.3 biomaRt_2.50.3 survcomp_1.44.1
[52] prodlim_2019.11.13 survival_3.3-1

loaded via a namespace (and not attached): [1] utf8_1.2.2 R.utils_2.11.0 tidyselect_1.1.2 RSQLite_2.2.14
[5] AnnotationDbi_1.56.2 TSP_1.2-0 BiocParallel_1.28.3 pROC_1.18.0
[9] munsell_0.5.0 codetools_0.2-18 future_1.25.0 withr_2.5.0
[13] colorspace_2.0-3 filelock_1.0.2 knitr_1.39 rstudioapi_0.13
[17] listenv_0.8.0 labeling_0.4.2 GenomeInfoDbData_1.2.7 farver_2.1.0
[21] bit64_4.0.5 downloader_0.4 parallelly_1.31.1 vctrs_0.4.1
[25] generics_0.1.2 ipred_0.9-12 xfun_0.31 BiocFileCache_2.2.1
[29] R6_2.5.1 doParallel_1.0.17 clue_0.3-60 seriation_1.3.5
[33] locfit_1.5-9.5 bitops_1.0-7 cachem_1.0.6 DelayedArray_0.20.0
[37] assertthat_0.2.1 promises_1.2.0.1 scales_1.2.0 nnet_7.3-17
[41] gtable_0.3.0 globals_0.15.0 timeDate_3043.102 rlang_1.0.2
[45] clisymbols_1.2.0 genefilter_1.76.0 systemfonts_1.0.4 GlobalOptions_0.1.2
[49] splines_4.1.2 lazyeval_0.2.2 ModelMetrics_1.2.2.2 broom_0.8.0
[53] yaml_2.3.5 modelr_0.1.8 BiocManager_1.30.17 reshape2_1.4.4
[57] crosstalk_1.2.0 backports_1.4.1 httpuv_1.6.5 rsconnect_0.8.25
[61] tools_4.1.2 lava_1.6.10 ellipsis_0.3.2 kableExtra_1.3.4
[65] jquerylib_0.1.4 proxy_0.4-26 Rcpp_1.0.8.3 plyr_1.8.7
[69] progress_1.2.2 zlibbioc_1.40.0 RCurl_1.98-1.6 prettyunits_1.1.1
[73] rpart_4.1.16 GetoptLong_1.0.5 fontawesome_0.2.2 haven_2.5.0
[77] fs_1.5.2 magrittr_2.0.3 data.table_1.14.2 circlize_0.4.15
[81] reprex_2.0.1 hms_1.1.1 TCGAbiolinksGUI.data_1.14.1 mime_0.12
[85] evaluate_0.15 XML_3.99-0.9 readxl_1.4.0 mclust_5.4.9
[89] gridExtra_2.3 shape_1.4.6 compiler_4.1.2 KernSmooth_2.23-20
[93] crayon_1.5.1 R.oo_1.24.0 htmltools_0.5.2 later_1.3.0
[97] tzdb_0.3.0 geneplotter_1.72.0 lubridate_1.8.0 DBI_1.1.2
[101] SuppDists_1.1-9.7 dbplyr_2.1.1 MASS_7.3-57 rappdirs_0.3.3
[105] Matrix_1.4-1 cli_3.3.0 R.methodsS3_1.8.1 parallel_4.1.2
[109] gower_1.0.0 pkgconfig_2.0.3 registry_0.5-1 recipes_0.2.0
[113] xml2_1.3.3 foreach_1.5.2 svglite_2.1.0 annotate_1.72.0
[117] bslib_0.3.1 hardhat_0.2.0 webshot_0.5.3 XVector_0.34.0
[121] rvest_1.0.2 digest_0.6.29 Biostrings_2.62.0 cellranger_1.1.0
[125] rmarkdown_2.14 dendextend_1.15.2 curl_4.3.2 gtools_3.9.2
[129] rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-157 jsonlite_1.8.0
[133] survivalROC_1.0.3 fansi_1.0.3 pillar_1.7.0 KEGGREST_1.34.0
[137] fastmap_1.1.0 httr_1.4.3 glue_1.6.2 iterators_1.0.14
[141] bit_4.0.4 sass_0.4.1 class_7.3-20 stringi_1.7.6
[145] bootstrap_2019.6 blob_1.2.3 caTools_1.18.2 memoise_2.0.1
[149] future.apply_1.9.0

ChristopherEeles commented 2 years ago

Hi @calisirkubra,

The differences in those two centroid matrices looks like rounding error to me. It's possible that the rounding is due to a change to the print method and the actual data is the same. You could check with by looking at the equality of the two matrices in memory.

This issue talks a bit more about the options(digits) setting: https://stackoverflow.com/questions/4540649/retain-numerical-precision-in-an-r-data-frame.

Either way, the difference---if it is a real one---is in the deep decimal places and in unlikely to affect the classification results.

Regarding the use of RNA-sequencing data we generally do not recommend this. If you plan to do so, please use your best scientific judgment. See #22 for a discussion of some of the issues with using microarray derived signatures on RNA-sequencing data.

Best, Christopher Eeles Software Developer Haibe-Kains Lab | PM-Research University Health Network