BIMSBbioinfo / ikarus

Identifying tumor cells at the single-cell level using machine learning
MIT License
45 stars 12 forks source link

Problems about the feature selection of the model #16

Closed Jonyyqn closed 1 year ago

Jonyyqn commented 1 year ago

Hi, I feel that this is a cool research, and try to understand some details of the model, but I have encountered some problems. I wonder if you would be willing to answer: 1、Feature selection: I noticed that the resulting features of ikarus contains two gene sets: 162 tumor cell-specific genes and 1313 normal cell-specific genes. The article uses the intersection and cross-validation of multiple single-cell datasets for feature selection. So what I want to ask is, which single-cell data sets are the intersections of these features that are finally confirmed? (Is it the five datasets used for cross-validation in the original text below: For cross validation, we have used the two lung cancer datasets from Laughney [27] and Lambrechts [28], a colorectal cancer from [29], neuroblastoma dataset from Kildisiute [30], and a head and neck cancer datasets from [31].) 2、Some problems with using the ikarus model: I loaded the already trained ikarus model downloaed from github. Then, when I used the public single-cell data set for testing, I found that if the intersection of the genes in the single-cell data set and the features of the model is less than 80%, an error will be reported during the model test: "input data contains NaN". But I checked my h5ad file and confirmed that there are no NaN values in it. When I partially prune the features of the model to ensure that more than 80% of the features appear in the single-cell dataset, the model test is fine. I was wondering if any of you had a similar problem and tried to fix it (Maybe this is an inherent problem from AUCell ?).

dohmjan commented 1 year ago

Hi!

to 1): For the final gene sets you mentioned we used the Lee et al. colorectal cancer and the Laughney et al. lung cancer data set. Code-wise you can also follow the gene set creation in the extra section of the tutorial.

to 2): yeah exactly, that was inherent from AUCell. In those cases we did something similar to what you described. Take a look at the adapt_signature argument in the classifier class or the check_signatures_overlap function.

Jonyyqn commented 1 year ago

Hi!

to 1): For the final gene sets you mentioned we used the Lee et al. colorectal cancer and the Laughney et al. lung cancer data set. Code-wise you can also follow the gene set creation in the extra section of the tutorial.

to 2): yeah exactly, that was inherent from AUCell. In those cases we did something similar to what you described. Take a look at the adapt_signature argument in the classifier class or the check_signatures_overlap function.

Thank you very much !

Jonyyqn commented 1 year ago

I also want to ask if you have tried different scoring methods and whether this will affect the performance of the ikarus model

frenkiboy commented 1 year ago

Yes it does. We found out that different scoring methods perform substantially different on datasets from various origins, e.g. single cell, bulk, spatial. Unfortunately, we didn't have enough time to carefully benchmark which method prefers which dataset and figure out why this is happening.

Jonyyqn commented 1 year ago

Yes it does. We found out that different scoring methods perform substantially different on datasets from various origins, e.g. single cell, bulk, spatial. Unfortunately, we didn't have enough time to carefully benchmark which method prefers which dataset and figure out why this is happening.

OK. In fact, I am still curious about whether you have tried to directly use gene expression instead of gene set score as model input to see if the performance of ikarus will change?

frenkiboy commented 1 year ago

We did - we spend one year doing it on raw values. The problem is that the technical and biological confounders between datases cause a shift in the domain of the learner, that's why we shifted to using invariant features (gene set scores). Nowdays the allelic imbalance caused by CNVs seems to be the most accurate classifier - a bit slow to run though.

Jonyyqn commented 1 year ago

We did - we spend one year doing it on raw values. The problem is that the technical and biological confounders between datases cause a shift in the domain of the learner, that's why we shifted to using invariant features (gene set scores). Nowdays the allelic imbalance caused by CNVs seems to be the most accurate classifier - a bit slow to run though.

According to your description, I guess training on raw values may lead to more overfitting problems, resulting in poor performance on independent validation sets (batch effect). I agree with you that a classifier based on CNV signatures is the best classifier out there, because this classifier tries to distinguish tumor cells from normal cells at the inferred genetic level. But it may be limited to tumor types where the genetic alterations are more pronounced (or more directly, genetic changes in gene expression are more pronounced).