Full model training dataset not working

alvarezprado commented 2 years ago

Hello,

I am currently trying to use kassandra locally and have followed your tutorial (Python notebook) to train the full model, using "all_models_expr.tsv" (this is actually a csv file, by the way) and "laboratory_data_expressions.tsv" (and their corresponding annotation files) as provided in your webpage.

However, when I execute the steps corresponding to the "Pseudobulk generation" (only 30 data points, as an initial test), I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-c5da6225664a> in <module>
      5               tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
      6               num_av=3, num_points=30)
----> 7 expr, values = mixer.generate('Immune_general')
      8 values

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate(self, modeled_cell, genes, random_seed)
    122 
    123         if not genes:
--> 124             genes = self.cell_types[modeled_cell].genes
    125 
    126         mixed_cells_expr = pd.DataFrame(np.zeros((len(genes), self.num_points)),

/media/data/Linux/angel/Kassandra-main/core/cell_types.py in __getitem__(self, item)
    111 
    112     def __getitem__(self, item):
--> 113         return self._types_dict[item]
    114 
    115     def __getattr__(self, item):

KeyError: 'Immune_general'

Do you know what might be the problem?

Here the code I executed to get to that point:

import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import Image
from core.mixer import Mixer
from core.cell_types import CellTypes
from core.model import DeconvolutionModel
from core.plotting import print_cell_matras, cells_p, print_all_cells_in_one
from core.utils import *

cancer_sample_annot = pd.read_csv('data/all_models_annot.tsv', sep=',', index_col=0)
cancer_expr = pd.read_csv('data/all_models_expr.tsv', sep=',', index_col=0)
cells_sample_annot = pd.read_csv('data/laboratory_data_annotation.tsv', sep='\t', index_col=0)
cells_expr = pd.read_csv('data/laboratory_data_expressions.tsv', sep='\t', index_col=0)

# Pseudobulk generation
cell_types = CellTypes.load('configs/full_blood_model.yaml')
mixer = Mixer(cell_types=cell_types,
              cells_expr=cells_expr, cells_annot=cells_sample_annot,
              tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
              num_av=3, num_points=30)
expr, values = mixer.generate('Immune_general')

Thank you!

Edit: training the model with the toyset provided in the example works fine, so I assume this is related to the input files used to train the full model.

alvarezprado commented 2 years ago

I realized that the annotation in the "full_blood_model.yaml" file is different from the yaml file used in the tutorial and does not include a 'Immune_general' but 'General_cells' category, I'll try with that one and post here if it works.

alvarezprado commented 2 years ago

Running the code with "General_cells" produces the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Dataset'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-d3180a1511fd> in <module>
      5               tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
      6               num_av=3, num_points=30)
----> 7 expr, values = mixer.generate('General_cells')
      8 values

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate(self, modeled_cell, genes, random_seed)
    131 
    132         average_cells = {**self.generate_pure_cell_expressions(genes, 1, cells_to_mix),
--> 133                          **self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])}
    134         mixed_cells_values = self.dirichlet_mixing(self.num_points, cells_to_mix)
    135 

/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate_pure_cell_expressions(self, genes, num_av, cells_to_mix)
    179             for i in range(num_av):
    180                 if self.rebalance_param is not None:
--> 181                     cells_index = pd.Index(self.rebalance_samples_by_type(self.cells_annot.loc[cells_selection.index],
    182                                                                           k=self.rebalance_param))
    183                 else:

/media/data/Linux/angel/Kassandra-main/core/mixer.py in rebalance_samples_by_type(annot, k)
    257         :return: list of samples
    258         """
--> 259         type_counter = annot['Dataset'].value_counts()
    260 
    261         func = lambda x: x**(1 - k)

/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'Dataset'

And the same happens trying other more specific categories, like "Lymphoid_cells". Any hints? Am I using the right input files?

Thanks!