LundTax2023Classifier

This package implements a Random Forest rule-based single-sample predictor that classifies gene expression samples into the 5 (or 7, including subclasses) Lund Taxonomy molecular subtypes. The final classifier is composed of two separate predictors applied sequentially - first a sample is classified as one of the 5 main classes (Uro, GU, BaSq, Mes or ScNE), and then samples classified as Uro are subclassified into UroA, UroB or UroC by a second predictor. The package includes a sample dataset (Lund2017) to run the classifier.

Installation

You can install LundTax2023Classifier from GitHub with:

# install.packages("devtools")
devtools::install_github("LundBladderCancerGroup/LundTaxonomy2023Classifier")

Usage

Prediction

predict_LundTax2023(data, 
                    include_data = FALSE,
                    include_scores = TRUE,
                    gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
                    ...)

Where data is a matrix, data frame or multiclassPairs_object of gene expression values with genes in rows and samples in columns. One single sample can be classified, but it should also be in matrix format: one column with gene identifiers as rownames. The default gene identifier is HGNC symbols, but they can also be provided in ensembl gene or entrezgene IDs.

include_data is a logical value indicating if the gene expression values should be included in the results object.

include_scores is a logical value indicating if the prediction scores should be included in the results object.

gene_id character value specifying the type of gene identifier used in the data:

hgnc_symbol for HGNC symbols (default)
ensembl_gene_id for Ensembl gene IDs
entrezgene for Entrez IDs

The predict function includes an imputation feature to handle missing genes in the data. This can be accessed by adding the impute = TRUE argument.

Example

library(LundTax2023Classifier)
results <- predict_LundTax2023(Lund2017)
str(results)
#> List of 3
#>  $ scores              : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>   .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#>  $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...

# Include data in result
results_data <- predict_LundTax2023(Lund2017,
                               include_data = TRUE)
str(results_data)
#> List of 4
#>  $ data                : num [1:15697, 1:301] 4.25 8.35 6.69 7.26 3.74 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:15697] "A1CF" "A2M" "A2ML1" "A4GALT" ...
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ scores              : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>   .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#>  $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#>  $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#>   ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...

# Imputation
# Remove 100 genes from data
missing_genes <- sample(1:nrow(Lund2017),100)
Lund2017_missinggenes <- Lund2017[-missing_genes,]
results_imputation <- predict_LundTax2023(Lund2017_missinggenes,
                                          impute = TRUE)
#> These genes are not found in the data:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> These genes will be imputed to the closest class for each sample with NAs
#> These genes are not found in the data:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> These genes will be imputed to the closest class for each sample with NAs

The classifier returns a list of up to 4 elements:

data original gene expression values.
scores matrix containing predictions scores for 8 classes (Uro, UroA, UroB, UroC, GU, BaSq, Mes and ScNE).
predictions_7classes named vector, with sample names as names and subtype labels as values.
predictions_5classes named vector, with sample names as names and subtype labels as values.

Both data and scores can be excluded or included from the final output in the include_data and include_scores parameters, respectively.

Plotting

A plotting function is included to draw a heatmap showing genes, gene signatures and scores of interest. This function requires the ComplexHeatmap package.

plot_signatures(
  results_object,
  data = NULL,
  title = "",
  gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
  annotation = c("5 classes", "7 classes")[2],
  plot_scores = TRUE,
  show_ann_legend = FALSE,
  show_hm_legend = FALSE,
  set_order = NULL,
  ann_height = 6,
  font.size = 8,
  norm = c("scale", NULL)[1]
)

Parameters:

results_object is a list resulting from applying the predict_LundTax2023 function
data is is a matrix, data frame or multiclassPairs_object of gene expression values with gene identifiers in rows and samples in columns. This can be included if the results_object does not include the data, and samples should be in the same order as in the results object.
gene_id character value specifying the type of gene identifier used in the data:
- hgnc_symbol for HGNC symbols (default)
- ensembl_gene_id for Ensembl gene IDs
- entrezgene for Entrez IDs
title title for the heatmap.
annotation is acharacter indicating if 5 (“5 classes”) or 7 class (“7 classes”) annotations should be plotted.
plot_scores is a logical value indicating if the prediction scores should be plotted.
show_ann_legend is a logical value indicating if the annotation legend should be shown.
show_hm_legend is a logical value indicating if the heatmap legend should be shown.
set_order is a logical value indicating if the prediction scores should be plotted.
ann_heigh annotation height in cm, default is 6.
font.size font size, default is 8.
norm indicates if data should be scaled/Z-normalized. If “NULL”, data is plotted as is.

Example

# Including data in results object
results <- predict_LundTax2023(Lund2017,
                              include_data = TRUE)
plot_signatures(results)

LundBladderCancerGroup / LundTaxonomy2023Classifier

readme

LundTax2023Classifier

Installation

Usage

Prediction

Example

Plotting

Example