This package implements a Random Forest rule-based single-sample predictor that classifies gene expression samples into the 5 (or 7, including subclasses) Lund Taxonomy molecular subtypes. The final classifier is composed of two separate predictors applied sequentially - first a sample is classified as one of the 5 main classes (Uro, GU, BaSq, Mes or ScNE), and then samples classified as Uro are subclassified into UroA, UroB or UroC by a second predictor. The package includes a sample dataset (Lund2017) to run the classifier.
You can install LundTax2023Classifier from GitHub with:
# install.packages("devtools")
devtools::install_github("LundBladderCancerGroup/LundTaxonomy2023Classifier")
predict_LundTax2023(data,
include_data = FALSE,
include_scores = TRUE,
gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
...)
Where data
is a matrix, data frame or multiclassPairs_object of gene
expression values with genes in rows and samples in columns. One single
sample can be classified, but it should also be in matrix format: one
column with gene identifiers as rownames. The default gene identifier is
HGNC symbols, but they can also be provided in ensembl gene or
entrezgene IDs.
include_data
is a logical value indicating if the gene expression
values should be included in the results object.
include_scores
is a logical value indicating if the prediction scores
should be included in the results object.
gene_id
character value specifying the type of gene identifier used in
the data:
hgnc_symbol
for HGNC symbols (default)
ensembl_gene_id
for Ensembl gene IDs
entrezgene
for Entrez IDs
The predict function includes an imputation feature to handle missing
genes in the data. This can be accessed by adding the impute = TRUE
argument.
library(LundTax2023Classifier)
results <- predict_LundTax2023(Lund2017)
str(results)
#> List of 3
#> $ scores : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#> .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#> $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#> ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#> $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#> ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
# Include data in result
results_data <- predict_LundTax2023(Lund2017,
include_data = TRUE)
str(results_data)
#> List of 4
#> $ data : num [1:15697, 1:301] 4.25 8.35 6.69 7.26 3.74 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:15697] "A1CF" "A2M" "A2ML1" "A4GALT" ...
#> .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#> $ scores : num [1:301, 1:8] 0.9924 0.0016 0.0682 0.993 0.9808 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#> .. ..$ : chr [1:8] "Uro" "UroA" "UroB" "UroC" ...
#> $ predictions_7classes: Named chr [1:301] "UroA" "Mes" "GU" "UroA" ...
#> ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
#> $ predictions_5classes: Named chr [1:301] "Uro" "Mes" "GU" "Uro" ...
#> ..- attr(*, "names")= chr [1:301] "p1404_1.CEL" "2.CEL" "3.CEL" "4.CEL" ...
# Imputation
# Remove 100 genes from data
missing_genes <- sample(1:nrow(Lund2017),100)
Lund2017_missinggenes <- Lund2017[-missing_genes,]
results_imputation <- predict_LundTax2023(Lund2017_missinggenes,
impute = TRUE)
#> These genes are not found in the data:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> DSP JUP TMEM42 EXT1 WAS ACYP1 HEPACAM2
#> These genes will be imputed to the closest class for each sample with NAs
#> These genes are not found in the data:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> Gene names should as rownames and sample names as columns!
#> Check the genes in classifier object to see all the needed genes.
#> Check if '-' or ',' symbols in the gene names in your data. You may need to change it to '_' or '.'
#> Missed genes will be imputed to the closest class for each sample!
#> These genes have NAs:
#> P4HA1 ARHGEF10L WAS AOC2 EXT1 DUSP16 PI4KB FCRL5 SLC2A3
#> These genes will be imputed to the closest class for each sample with NAs
The classifier returns a list of up to 4 elements:
data
original gene expression values.scores
matrix containing predictions scores for 8 classes (Uro,
UroA, UroB, UroC, GU, BaSq, Mes and ScNE).predictions_7classes
named vector, with sample names as names and
subtype labels as values.predictions_5classes
named vector, with sample names as names and
subtype labels as values.Both data and scores can be excluded or included from the final output in the include_data and include_scores parameters, respectively.
A plotting function is included to draw a heatmap showing genes, gene signatures and scores of interest. This function requires the ComplexHeatmap package.
plot_signatures(
results_object,
data = NULL,
title = "",
gene_id = c("hgnc_symbol","ensembl_gene_id","entrezgene")[1],
annotation = c("5 classes", "7 classes")[2],
plot_scores = TRUE,
show_ann_legend = FALSE,
show_hm_legend = FALSE,
set_order = NULL,
ann_height = 6,
font.size = 8,
norm = c("scale", NULL)[1]
)
Parameters:
results_object
is a list resulting from applying the
predict_LundTax2023 function
data
is is a matrix, data frame or multiclassPairs_object of gene
expression values with gene identifiers in rows and samples in
columns. This can be included if the results_object does not include
the data, and samples should be in the same order as in the results
object.
gene_id
character value specifying the type of gene identifier used
in the data:
hgnc_symbol
for HGNC symbols (default)ensembl_gene_id
for Ensembl gene IDsentrezgene
for Entrez IDstitle
title for the heatmap.
annotation
is acharacter indicating if 5 (“5 classes”) or 7 class
(“7 classes”) annotations should be plotted.
plot_scores
is a logical value indicating if the prediction scores
should be plotted.
show_ann_legend
is a logical value indicating if the annotation
legend should be shown.
show_hm_legend
is a logical value indicating if the heatmap legend
should be shown.
set_order
is a logical value indicating if the prediction scores
should be plotted.
ann_heigh
annotation height in cm, default is 6.
font.size
font size, default is 8.
norm
indicates if data should be scaled/Z-normalized. If “NULL”,
data is plotted as is.
# Including data in results object
results <- predict_LundTax2023(Lund2017,
include_data = TRUE)
plot_signatures(results)