LundBladderCancerGroup / LundTaxonomy2023Classifier

GNU General Public License v2.0
1 stars 0 forks source link

LundTax2023Classifier

This package implements a Random Forest rule-based single-sample predictor that classifies gene expression samples into the 5 (or 7, including subclasses) Lund Taxonomy molecular subtypes. The final classifier is composed of two separate predictors applied sequentially - first a sample is classified as one of the 5 main classes (Uro, GU, BaSq, Mes or ScNE), and then samples classified as Uro are subclassified into UroA, UroB or UroC by a second predictor. The package includes a sample dataset (Lund2017) to run the classifier.

Installation

You can install LundTax2023Classifier from GitHub with:

# install.packages("devtools")
devtools::install_github("LundBladderCancerGroup/LundTaxonomy2023Classifier")

Usage

Prediction

predict_LundTax2023(data, 
                    include_data = FALSE,
                    include_scores = TRUE,
                    gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
                    ...)

Where data is a matrix, data frame or multiclassPairs_object of gene expression values with genes in rows and samples in columns. One single sample can be classified, but it should also be in matrix format: one column with gene identifiers as rownames. The default gene identifier is HGNC symbols, but they can also be provided in ensembl gene or entrezgene IDs.

include_data is a logical value indicating if the gene expression values should be included in the results object.

include_scores is a logical value indicating if the prediction scores should be included in the results object.

gene_id character value specifying the type of gene identifier used in the data:

The predict function includes an imputation feature to handle missing genes in the data. This can be accessed by adding the impute = TRUE argument.

Example

library(LundTax2023Classifier)
results <- predict_LundTax2023(Lund2017)
str(results)
#> List of 3
#>  $ scores              : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>   .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#>  $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...

# Include data in result
results_data <- predict_LundTax2023(Lund2017,
                               include_data = TRUE)
str(results_data)
#> List of 4
#>  $ data                : num [1:15697, 1:301] 4.25 8.35 6.69 7.26 3.74 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:15697] "A1CF" "A2M" "A2ML1" "A4GALT" ...
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ scores              : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>   .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#>  $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
# Imputation
# Remove 100 genes from data
missing_genes <- sample(1:nrow(Lund2017),100)
Lund2017_missinggenes <- Lund2017[-missing_genes,]
results_imputation <- predict_LundTax2023(Lund2017_missinggenes,
                                          impute = TRUE)
#> These genes are not found in the data:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> These genes will be imputed to the closest class for each sample with NAs
#> These genes are not found in the data:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> These genes will be imputed to the closest class for each sample with NAs

The classifier returns a list of up to 4 elements:

Both data and scores can be excluded or included from the final output in the include_data and include_scores parameters, respectively.

Plotting

A plotting function is included to draw a heatmap showing genes, gene signatures and scores of interest. This function requires the ComplexHeatmap package.

plot_signatures(
  results_object,
  data = NULL,
  title = "",
  gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
  annotation = c("5 classes", "7 classes")[2],
  plot_scores = TRUE,
  show_ann_legend = FALSE,
  show_hm_legend = FALSE,
  set_order = NULL,
  ann_height = 6,
  font.size = 8,
  norm = c("scale", NULL)[1]
)

Parameters:

Example

# Including data in results object
results <- predict_LundTax2023(Lund2017,
                              include_data = TRUE)
plot_signatures(results)