BostonGene / Kassandra

Bostongene cell deconvolution algorithm from RNAseq
Other
50 stars 7 forks source link

Full model training dataset not working #3

Open alvarezprado opened 2 years ago

alvarezprado commented 2 years ago

Hello,

I am currently trying to use kassandra locally and have followed your tutorial (Python notebook) to train the full model, using "all_models_expr.tsv" (this is actually a csv file, by the way) and "laboratory_data_expressions.tsv" (and their corresponding annotation files) as provided in your webpage.

However, when I execute the steps corresponding to the "Pseudobulk generation" (only 30 data points, as an initial test), I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-c5da6225664a> in <module>
      5               tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
      6               num_av=3, num_points=30)
----> 7 expr, values = mixer.generate('Immune_general')
      8 values

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate(self, modeled_cell, genes, random_seed)
    122 
    123         if not genes:
--> 124             genes = self.cell_types[modeled_cell].genes
    125 
    126         mixed_cells_expr = pd.DataFrame(np.zeros((len(genes), self.num_points)),

/media/data/Linux/angel/Kassandra-main/core/cell_types.py in __getitem__(self, item)
    111 
    112     def __getitem__(self, item):
--> 113         return self._types_dict[item]
    114 
    115     def __getattr__(self, item):

KeyError: 'Immune_general'

Do you know what might be the problem?

Here the code I executed to get to that point:

import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import Image
from core.mixer import Mixer
from core.cell_types import CellTypes
from core.model import DeconvolutionModel
from core.plotting import print_cell_matras, cells_p, print_all_cells_in_one
from core.utils import *

cancer_sample_annot = pd.read_csv('data/all_models_annot.tsv', sep=',', index_col=0)
cancer_expr = pd.read_csv('data/all_models_expr.tsv', sep=',', index_col=0)
cells_sample_annot = pd.read_csv('data/laboratory_data_annotation.tsv', sep='\t', index_col=0)
cells_expr = pd.read_csv('data/laboratory_data_expressions.tsv', sep='\t', index_col=0)

# Pseudobulk generation
cell_types = CellTypes.load('configs/full_blood_model.yaml')
mixer = Mixer(cell_types=cell_types,
              cells_expr=cells_expr, cells_annot=cells_sample_annot,
              tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
              num_av=3, num_points=30)
expr, values = mixer.generate('Immune_general')

Thank you!

Edit: training the model with the toyset provided in the example works fine, so I assume this is related to the input files used to train the full model.

alvarezprado commented 2 years ago

I realized that the annotation in the "full_blood_model.yaml" file is different from the yaml file used in the tutorial and does not include a 'Immune_general' but 'General_cells' category, I'll try with that one and post here if it works.

alvarezprado commented 2 years ago

Running the code with "General_cells" produces the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Dataset'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-d3180a1511fd> in <module>
      5               tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
      6               num_av=3, num_points=30)
----> 7 expr, values = mixer.generate('General_cells')
      8 values

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate(self, modeled_cell, genes, random_seed)
    131 
    132         average_cells = {**self.generate_pure_cell_expressions(genes, 1, cells_to_mix),
--> 133                          **self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])}
    134         mixed_cells_values = self.dirichlet_mixing(self.num_points, cells_to_mix)
    135 

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate_pure_cell_expressions(self, genes, num_av, cells_to_mix)
    179             for i in range(num_av):
    180                 if self.rebalance_param is not None:
--> 181                     cells_index = pd.Index(self.rebalance_samples_by_type(self.cells_annot.loc[cells_selection.index],
    182                                                                           k=self.rebalance_param))
    183                 else:

/media/data/Linux/angel/Kassandra-main/core/mixer.py in rebalance_samples_by_type(annot, k)
    257         :return: list of samples
    258         """
--> 259         type_counter = annot['Dataset'].value_counts()
    260 
    261         func = lambda x: x**(1 - k)

/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'Dataset'

And the same happens trying other more specific categories, like "Lymphoid_cells". Any hints? Am I using the right input files?

Thanks!

Jojo-LIU-LIU commented 2 years ago

I have the same error

jsangalang commented 1 year ago

Have you guys resolved this somehow?

jsangalang commented 1 year ago

I may have a quick solution to this issue (for those who want to use it).

In the core/mixer.py, scroll to the function "change_subtype_proportions" (i.e., def change_subtype_proportions(self, cell: str, cells_index: pd.Index) -> pd.Index:), and edit: specified_subtypes = set(self.proportions.dropna().index).intersection(cell_subtypes) into specified_subtypes = list(set(self.proportions.dropna().index).intersection(cell_subtypes))

Save, and restart the kernel.

Have you guys resolved this somehow?

jsangalang commented 1 year ago

Whoops, my last response was an edit for the model training example. I now have the same similar issue, with "Dataset" KeyError. I think it is an issue with formatting the files, or the code not being able to read the indices provided. There is a KeyError "Dataset" here because it returns an empty matrix, which means it is not able to locate the desired column 'Dataset'. I'm still investigating the code.

jsangalang commented 1 year ago

Whoops, my last response was an edit for the model training example. I now have the same similar issue, with "Dataset" KeyError. I think it is an issue with formatting the files, or the code not being able to read the indices provided. There is a KeyError "Dataset" here because it returns an empty matrix, which means it is not able to locate the desired column 'Dataset'. I'm still investigating the code.

I believe the input data files are not formatted properly to be utilized directly in the model. To overcome this, I'm formatting the input to separate the cancer from normal cells (both annotation and expressions), and to add the laboratory dataset. In the paper, they have 9041 cells in the training data + 15 from plasma/non-plasma dataset + 348 from their own dataset = 9404 cells in total, which corresponds to the datasets provided. Notes:

Here's a simple code to separate the cancer and normal cells to use in the training model, as they provide in the Jupyter notebook.

# Import training data
dataset_all_anno = pd.read_csv('training_data/all_models_annot.tsv', sep=',', index_col=0)
dataset_all_expr = pd.read_csv('training_data/all_models_expr.tsv', sep=',', index_col=0)

# Laboratory dataset
lab_dataset_anno = pd.read_csv('training_data/laboratory_data_annotation.tsv', sep='\t', index_col = 0)
lab_dataset_expr = pd.read_csv('training_data/laboratory_data_expressions.tsv', sep='\t', index_col=0)

########################

# Separate into cancer and normal cells

# Cancer
cancer_cells_rows_anno = dataset_all_anno[~dataset_all_anno['Tumor_model_annot'].isna()]
cancer_cells_rows_anno = cancer_cells_rows_anno[cancer_cells_rows_anno['Tumor_model_annot']=='cancer_cells']
cancer_cells_rows_anno = cancer_cells_rows_anno[['Tumor_model_annot', 'Dataset']]
cancer_cells_rows_anno.rename(columns = {'Tumor_model_annot':'Cell_type'}, inplace = True)
cancer_cells_rows_anno['Sample'] = list(cancer_cells_rows_anno.index)
cancer_cells_rows_expr = dataset_all_expr[cancer_cells_rows_anno.index]

# Normal
normal_cells_rows_anno = dataset_all_anno[~dataset_all_anno['Tumor_model_annot'].isna()]
normal_cells_rows_anno = normal_cells_rows_anno[normal_cells_rows_anno['Tumor_model_annot']!='cancer_cells']
normal_cells_rows_anno = normal_cells_rows_anno[['Tumor_model_annot', 'Dataset']]
normal_cells_rows_anno.rename(columns = {'Tumor_model_annot':'Cell_type'}, inplace = True)
normal_cells_rows_anno['Sample'] = list(normal_cells_rows_anno.index)
normal_cells_rows_expr = dataset_all_expr[normal_cells_rows_anno.index]

# CHECK: Count how many cells are not NaN in the tumor model. If n = 8146, then it matches the supplementary tables in Table S1
dataset_all_anno['Tumor_model_annot'].count()

########################

# Then run the code as in the Jupyter notebook.

# Pseudobulk generation (for artificial datasets)
cell_types = CellTypes.load('configs/cell_types.yaml')
mixer = Mixer(cell_types=cell_types,
              cells_expr=normal_cells_rows_expr, 
              cells_annot=normal_cells_rows_anno,
              tumor_expr=cancer_cells_rows_expr, 
              tumor_annot=cancer_cells_rows_anno,
              num_av = 3, 
              num_points = 30) 

# etc. etc ...
jsangalang commented 1 year ago

I believe there may be some missing data. It seems there are no "Dendritic_cells" nor "Granulocytes" in the training dataset they provided. Therefore, there are errors when trying to train the model since no samples are of these cell types.

jsangalang commented 1 year ago

I believe there may be some missing data. It seems there are no "Dendritic_cells" nor "Granulocytes" in the training dataset they provided. Therefore, there are errors when trying to train the model since no samples are of these cell types.

I tried removing these cell types in the "cell_types.yaml" config, and it worked. However, the results may vary because of this, we will continue to experiment. Another option is to add the samples from the sample data set into the model training data set.

yuandyuand13 commented 1 year ago

Hello, @jsangalang , I encountered a big problem downloading data. When I clicked the link of the laboratory_data_expressions.tsv, it went to the web page “ https://science.bostongene.com/undefined/download/laboratory_data_expressions.tsv ”, and nothing was returned. I don't know how to solve this. I would appreciate it if you provide me the following data: all_models_annot.tsv,all_models_expr.tsv,laboratory_data_annotation.tsv,and laboratory_data_expressions.tsv. Thanks!

jsangalang commented 1 year ago

Hello, @jsangalang , I encountered a big problem downloading data. When I clicked the link of the laboratory_data_expressions.tsv, it went to the web page “ https://science.bostongene.com/undefined/download/laboratory_data_expressions.tsv ”, and nothing was returned. I don't know how to solve this. I would appreciate it if you provide me the following data: all_models_annot.tsv,all_models_expr.tsv,laboratory_data_annotation.tsv,and laboratory_data_expressions.tsv. Thanks!

Hi @yuandyuand13 ! They just mentioned that it should be working again! :)

shpakb commented 1 year ago

Hello,

The data that was uploaded to our website is in a slightly different format compared to the one given in the example. It needs a bit of preprocessing before being suitable for model training. I have just updated the "Full model.ipynb" with a few steps illustrating how to prepare this data for training the model on the complete dataset.

Best, Boris