BIMSBbioinfo / ikarus

Identifying tumor cells at the single-cell level using machine learning
MIT License
45 stars 12 forks source link

I ran into some problems when reproducing results of your article. #15

Closed Long-Wenxin closed 1 year ago

Long-Wenxin commented 2 years ago

Hello,

After reading your article “Identifying tumor cells at the single-cell level using machine learning”, I was very interested in the Ikarus method because of its excellent performance in classifying Tumor and Normal cells. But when reproducing the results, I ran into some problems presented as below:

  1. When selecting the gene signature, I used the create_all function in “gene_list.py” uploaded on Github, but I don’t know how to set label_upregs_list and label_downregs_list. I tried every possible setting I could think of, but it all turned out that the overlap of genes in signature and dataset to be smaller than 80%, so that AUCell couldn’t return scores.

  2. It is presented in your article that “a user-defined number of top genes is selected (for our analyses, we used the top 300 genes)”, but if I set top_x as 300, I can’t get a gene signature of more than 1000 genes. Then I tried setting top_x as 1500, but AUCell still couldn’t return scores.

The screenshots of codes and their results are attached below.

截屏2022-11-19 下午7 22 37 截屏2022-11-19 下午7 23 29 截屏2022-11-19 下午7 23 22

melonheader commented 2 years ago

Hello @Long-Wenxin,

It appears to me that you are setting label_upregs/downregs wrong. We suggest to use "major_hallmark_corrected" annotation.

For more detailed information, you could check a tutorial for gene list creation (could be found at the very bottom of the tutorial).

Long-Wenxin commented 2 years ago

Thank you for your reply. I have solved that problem and created the gene list same to yours.

But there are problems with two datasets "livnat" and "ma". Their overlap with the gene signature is less than 80%, but your article shows that those two datasets can be trained and tested.

I wonder how you solved this problem.

dohmjan commented 2 years ago

Hi, we also noticed that for a few datasets. The 80% restriction is coming from AUCell (last time I checked, it was not straight-forward to change it manually). We implemented a small helper function to circumvent that threshold (see check_signatures_overlap). It can be enabled when setting adapt_signatures to True in the Ikarus class definition. Note, only do this with caution and maybe check the overlap beforehand (as you did).

Long-Wenxin commented 2 years ago

Okay I've understood. Thank you so much for your explanation!

erevkov commented 1 year ago

Sorry to reopen, but I have a related issue, so I thought I would post it here rather than opening a new thread. I am trying to test ikarus on a dataset benchmarked in the original paper (Lambrechts, lung carcinoma). However, I am getting very low prediction accuracy and I was wondering what might be the reason for it.

My setup: I first downloaded the expression and annotation data from the E-MTAB-6149 ArrayExpress repository, normalized the counts using scanpy and log2(x+1)-transformed the normalized matrix. I then installed ikarus using the provided instructions and downloaded the core model and the signatures from this repository. Just to clarify: I specifically wanted to use a pre-trained version of the model.

This is how I then ran ikarus:

# model setup
signatures_path = "/home/ubuntu/sc_mal_dev/data/misc/ikarus_supp/signatures.gmt"
model_path = "/home/ubuntu/sc_mal_dev/data/misc/ikarus_supp/core_model.joblib"
out_dir = "/home/ubuntu/sc_mal_dev/data/misc/ikarus_supp/"

model = classifier.Ikarus(signatures_gmt=signatures_path, out_dir=out_dir, adapt_signatures=True)
model.load_core_model(model_path)

### load the gene expression data into adata, adata is log2(norm_counts+1)-transformed ###

# predictions
preds = model.predict(adata, 'E-MTAB-6149', save=False)

However, in the end, when comparing with provided annotations, I am getting very low balanced accuracy (around 0.5) and f1 score (<0.1). In the paper, the reported balanced accuracy is much higher. Do you see anything wrong with the setup? I understand that there might be a myriad of issues, but I wonder if there is something obvious I am missing.

dohmjan commented 1 year ago

Hi! Looks good, I think. Though, it's hard to tell. It might be due to the label names and/or obs names in the adata object / dataset. Could you maybe compare your adata object with this one (~480MB)? In adata.obs there should be a "raw" column which should correspond to your labels, and a "tier_0" column just containing Tumor/Normal, which we used for training or testing.

erevkov commented 1 year ago

Thank you!

I can confirm that while using your adata object, I can reproduce your results. However, I get very different results if I use the expression values obtained elsewhere. I think that points to the potential differences in preprocessing. Can I ask, how did you obtain the final gene expression values for this dataset (Lambrechts), did you process the raw data yourself?

dohmjan commented 1 year ago

Hi, sorry for not getting back to you earlier! For the Lambrechts dataset we used the "all cells" .loom file from their resource page.

For example, something like this can be used for loading the loom file. The content should then be the same as in the anndata object I mentioned in my comment above (maybe a bit of restructuring of column and row names is needed.)

import anndata
import scanpy as sc
import pandas as pd
import numpy as np

path_to_loom_all_cells = '... .loom'
adata = anndata.read_loom(path_to_loom_all_cells, validate=False)

# Their labels:
print(adata.obs["ClusterName"].unique())
# > array(['CD4+ T cells ', 'CD8+ T cells ', 'mast cells ',
#    'Langerhans cells ', 'lower quality endothelial cell',
#    'macrophages', 'cancer cells pt 4', ..., 'cancer cells pt 5', ..., 'cancer cells pt 1',
#    ..., 'cancer cells pt 3', ..., 'cancer cells pt 2', ...
#    ],
#    dtype=object)

# From the labels in "ClusterName" we considered "cancer cells pt 1,2,3,4,5" to be cancerous
# and the rest to be normal.

# inplace preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
erevkov commented 1 year ago

This helped! Can confirm that after following this last set of instructions I was able to reproduce your results and get high balanced accuracy using the pretrained model. I guess that indicates that there was an issue with the version of the input data I was using. Anyway, thank you again!