Open alvarezprado opened 2 years ago
I realized that the annotation in the "full_blood_model.yaml" file is different from the yaml file used in the tutorial and does not include a 'Immune_general' but 'General_cells' category, I'll try with that one and post here if it works.
Running the code with "General_cells" produces the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Dataset'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-6-d3180a1511fd> in <module>
5 tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot,
6 num_av=3, num_points=30)
----> 7 expr, values = mixer.generate('General_cells')
8 values
/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate(self, modeled_cell, genes, random_seed)
131
132 average_cells = {**self.generate_pure_cell_expressions(genes, 1, cells_to_mix),
--> 133 **self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])}
134 mixed_cells_values = self.dirichlet_mixing(self.num_points, cells_to_mix)
135
/media/data/Linux/angel/Kassandra-main/core/mixer.py in generate_pure_cell_expressions(self, genes, num_av, cells_to_mix)
179 for i in range(num_av):
180 if self.rebalance_param is not None:
--> 181 cells_index = pd.Index(self.rebalance_samples_by_type(self.cells_annot.loc[cells_selection.index],
182 k=self.rebalance_param))
183 else:
/media/data/Linux/angel/Kassandra-main/core/mixer.py in rebalance_samples_by_type(annot, k)
257 :return: list of samples
258 """
--> 259 type_counter = annot['Dataset'].value_counts()
260
261 func = lambda x: x**(1 - k)
/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2904 if self.columns.nlevels > 1:
2905 return self._getitem_multilevel(key)
-> 2906 indexer = self.columns.get_loc(key)
2907 if is_integer(indexer):
2908 indexer = [indexer]
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 'Dataset'
And the same happens trying other more specific categories, like "Lymphoid_cells". Any hints? Am I using the right input files?
Thanks!
I have the same error
Have you guys resolved this somehow?
I may have a quick solution to this issue (for those who want to use it).
In the core/mixer.py, scroll to the function "change_subtype_proportions
" (i.e., def change_subtype_proportions(self, cell: str, cells_index: pd.Index) -> pd.Index:
), and edit:
specified_subtypes = set(self.proportions.dropna().index).intersection(cell_subtypes)
into
specified_subtypes = list(set(self.proportions.dropna().index).intersection(cell_subtypes))
Save, and restart the kernel.
Have you guys resolved this somehow?
Whoops, my last response was an edit for the model training example. I now have the same similar issue, with "Dataset" KeyError. I think it is an issue with formatting the files, or the code not being able to read the indices provided. There is a KeyError "Dataset" here because it returns an empty matrix, which means it is not able to locate the desired column 'Dataset'. I'm still investigating the code.
Whoops, my last response was an edit for the model training example. I now have the same similar issue, with "Dataset" KeyError. I think it is an issue with formatting the files, or the code not being able to read the indices provided. There is a KeyError "Dataset" here because it returns an empty matrix, which means it is not able to locate the desired column 'Dataset'. I'm still investigating the code.
I believe the input data files are not formatted properly to be utilized directly in the model. To overcome this, I'm formatting the input to separate the cancer from normal cells (both annotation and expressions), and to add the laboratory dataset. In the paper, they have 9041 cells in the training data + 15 from plasma/non-plasma dataset + 348 from their own dataset = 9404 cells in total, which corresponds to the datasets provided. Notes:
Here's a simple code to separate the cancer and normal cells to use in the training model, as they provide in the Jupyter notebook.
# Import training data
dataset_all_anno = pd.read_csv('training_data/all_models_annot.tsv', sep=',', index_col=0)
dataset_all_expr = pd.read_csv('training_data/all_models_expr.tsv', sep=',', index_col=0)
# Laboratory dataset
lab_dataset_anno = pd.read_csv('training_data/laboratory_data_annotation.tsv', sep='\t', index_col = 0)
lab_dataset_expr = pd.read_csv('training_data/laboratory_data_expressions.tsv', sep='\t', index_col=0)
########################
# Separate into cancer and normal cells
# Cancer
cancer_cells_rows_anno = dataset_all_anno[~dataset_all_anno['Tumor_model_annot'].isna()]
cancer_cells_rows_anno = cancer_cells_rows_anno[cancer_cells_rows_anno['Tumor_model_annot']=='cancer_cells']
cancer_cells_rows_anno = cancer_cells_rows_anno[['Tumor_model_annot', 'Dataset']]
cancer_cells_rows_anno.rename(columns = {'Tumor_model_annot':'Cell_type'}, inplace = True)
cancer_cells_rows_anno['Sample'] = list(cancer_cells_rows_anno.index)
cancer_cells_rows_expr = dataset_all_expr[cancer_cells_rows_anno.index]
# Normal
normal_cells_rows_anno = dataset_all_anno[~dataset_all_anno['Tumor_model_annot'].isna()]
normal_cells_rows_anno = normal_cells_rows_anno[normal_cells_rows_anno['Tumor_model_annot']!='cancer_cells']
normal_cells_rows_anno = normal_cells_rows_anno[['Tumor_model_annot', 'Dataset']]
normal_cells_rows_anno.rename(columns = {'Tumor_model_annot':'Cell_type'}, inplace = True)
normal_cells_rows_anno['Sample'] = list(normal_cells_rows_anno.index)
normal_cells_rows_expr = dataset_all_expr[normal_cells_rows_anno.index]
# CHECK: Count how many cells are not NaN in the tumor model. If n = 8146, then it matches the supplementary tables in Table S1
dataset_all_anno['Tumor_model_annot'].count()
########################
# Then run the code as in the Jupyter notebook.
# Pseudobulk generation (for artificial datasets)
cell_types = CellTypes.load('configs/cell_types.yaml')
mixer = Mixer(cell_types=cell_types,
cells_expr=normal_cells_rows_expr,
cells_annot=normal_cells_rows_anno,
tumor_expr=cancer_cells_rows_expr,
tumor_annot=cancer_cells_rows_anno,
num_av = 3,
num_points = 30)
# etc. etc ...
I believe there may be some missing data. It seems there are no "Dendritic_cells" nor "Granulocytes" in the training dataset they provided. Therefore, there are errors when trying to train the model since no samples are of these cell types.
I believe there may be some missing data. It seems there are no "Dendritic_cells" nor "Granulocytes" in the training dataset they provided. Therefore, there are errors when trying to train the model since no samples are of these cell types.
I tried removing these cell types in the "cell_types.yaml" config, and it worked. However, the results may vary because of this, we will continue to experiment. Another option is to add the samples from the sample data set into the model training data set.
Hello, @jsangalang , I encountered a big problem downloading data. When I clicked the link of the laboratory_data_expressions.tsv, it went to the web page “ https://science.bostongene.com/undefined/download/laboratory_data_expressions.tsv ”, and nothing was returned. I don't know how to solve this. I would appreciate it if you provide me the following data: all_models_annot.tsv,all_models_expr.tsv,laboratory_data_annotation.tsv,and laboratory_data_expressions.tsv. Thanks!
Hello, @jsangalang , I encountered a big problem downloading data. When I clicked the link of the laboratory_data_expressions.tsv, it went to the web page “ https://science.bostongene.com/undefined/download/laboratory_data_expressions.tsv ”, and nothing was returned. I don't know how to solve this. I would appreciate it if you provide me the following data: all_models_annot.tsv,all_models_expr.tsv,laboratory_data_annotation.tsv,and laboratory_data_expressions.tsv. Thanks!
Hi @yuandyuand13 ! They just mentioned that it should be working again! :)
Hello,
The data that was uploaded to our website is in a slightly different format compared to the one given in the example. It needs a bit of preprocessing before being suitable for model training. I have just updated the "Full model.ipynb" with a few steps illustrating how to prepare this data for training the model on the complete dataset.
Best, Boris
Hello,
I am currently trying to use kassandra locally and have followed your tutorial (Python notebook) to train the full model, using "all_models_expr.tsv" (this is actually a csv file, by the way) and "laboratory_data_expressions.tsv" (and their corresponding annotation files) as provided in your webpage.
However, when I execute the steps corresponding to the "Pseudobulk generation" (only 30 data points, as an initial test), I get the following error:
Do you know what might be the problem?
Here the code I executed to get to that point:
Thank you!
Edit: training the model with the toyset provided in the example works fine, so I assume this is related to the input files used to train the full model.