Customize training data

pikapika505 commented 1 year ago

I was very pleased that Ikarus predicted malignant cells with 0.93 accuracy in hepatocellular carcinoma (HCC) scRNA-seq data. I am planning to use it to find malignant cells in other unannotated HCC scRNA-seq datasets. So, I thought I could simply add an annotated HCC dataset to find a new tumor gene set to make it more specific to HCC datasets.

Is it possible to do so?
if yes, how was the "major_hallmark_corrected" found?
Also, what are major, tier_0 ... tier_3 columns in adata.obs?

thank you, Yulia

melonheader commented 1 year ago

Hello @YuliaInn

Very glad to hear that our package was of help!

For your questions:

The short answer is yes.
"major_hallmark_corrected" column is constructed from the Tumor/Normal classes of the input dataset (major column) by adjusting the labels with scores from cancer hallmark genesets.
The major column contains Tumor/Normal class labels from the input dataset. The tier_0, tier_1, and so on is the naming convention for hierarchical data annotation utilised by Ikarus. Namely, the high-order classification (Tumor/Normal) is stored in tier_0; lower-level classifications, such as tissue or cell type are stored in columns tier_1 and tier_2 respectively. Essentially, it is implemented to ease the use and avoid chaotic column names. In most cases, one would only utilise tier_0 and tier_1, henceforth other columns can be safely filled with NaNs.

Elaborating on your first question, to create your own model with Ikarus, you will need to prepare an anndata object for your dataset, hallmark correction is advised but not necessary. Then, generate new Tumor/Normal gene sets and train the classifier. You can get hints on the workflow in the tutorial's sections create gene lists, signatures, and train model.

melonheader commented 9 months ago

I am closing the issue. Fill free to re-open if anything else pops up.

BIMSBbioinfo / ikarus

Customize training data #19