mixer.generate and model.fit error

Yooooopick commented 1 year ago

Hello, Thank you for your hard work for Kassandra. It's a nice and useful tool for cell fraction detection from bulk RNAseq data. After git clone https://github.com/BostonGene/Kassandra/ and running the "Model Training.ipynb" vignettes using the example data in the "/data" directory, I get the following error:

expr,values = mixer.generate('General_cells') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/root/Kassandra/core/mixer.py", line 133, in generate **self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])} File "/root/Kassandra/core/mixer.py", line 189, in generate_pure_cell_expressions cells_index = self.change_subtype_proportions(cell=cell, File "/root/Kassandra/core/mixer.py", line 288, in change_subtype_proportions subtype_proportions = {cell: dict(self.proportions.loc[specified_subtypes])} File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 1091, in __getitem__ check_dict_or_set_indexers(key) File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 2618, in check_dict_or_set_indexers raise TypeError( TypeError: Passing a set as an indexer is not supported. Use a list instead.

and then, >>> model.fit(mixer) ============== L1 models ============== Generating mixes for B_cells model Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/root/Kassandra/core/model.py", line 78, in fit expr, values = mixer.generate(cell, genes=self.cell_types[cell].genes, random_seed=i+1) File "/root/Kassandra/core/mixer.py", line 132, in generate average_cells = {**self.generate_pure_cell_expressions(genes, 1, cells_to_mix), File "/root/Kassandra/core/mixer.py", line 189, in generate_pure_cell_expressions cells_index = self.change_subtype_proportions(cell=cell, File "/root/Kassandra/core/mixer.py", line 288, in change_subtype_proportions subtype_proportions = {cell: dict(self.proportions.loc[specified_subtypes])} File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 1091, in __getitem__ check_dict_or_set_indexers(key) File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 2618, in check_dict_or_set_indexers raise TypeError( TypeError: Passing a set as an indexer is not supported. Use a list instead.

Do you know what the problem might be? Thank you!

shpakb commented 1 year ago

Hi @Yooooopick,

There are some cell types terms that are missing in "Cell_type" column of cells annotation data frame. Here is some code to check what you are missing:

missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]
missing_cts

There should't be any problems if you just run "Model Training.ipynb" as it is. Just checked it.

Yooooopick commented 1 year ago

Thank you for your kind reply. I run the code and the result is shown below: > ['Immune_general', 'Monocytic_cells'] And I think the 'Immune_general','Monocytic_cells' belong to the upper level of annotation to such as Monocytes and macrophage and actually can not appear in the training data. But concerning about this reason, I edit the "/config/cell_types.yaml" file and remove the 'Immune_general' and 'Monocytic_cells' ones and change the parent_type to "General_cells" despite the cell_proportion and so on will not be accurate. The same error appeared again.

Actually, I run the "Model Training.ipynb" vignettes using the example data in the "/data" directory after getting clone from the website just like below and this error is still here. cancer_sample_annot = pd.read_csv('data/cancer_samples_annot.tsv.tar.gz', sep='\t', index_col=0) cancer_expr = pd.read_csv('data/cancer_expr.tsv.tar.gz', sep='\t', index_col=0) cells_sample_annot = pd.read_csv('data/cells_samples_annot.tsv.tar.gz', sep='\t', index_col=0) cells_expr = pd.read_csv('data/cells_expr.tsv.tar.gz', sep='\t', index_col=0)

I will appreciate your recommended solution.

shpakb commented 1 year ago

Here is some code to patch annotation for missing cell types:

# adding missing cell types
cell_types = CellTypes.load('configs/full_blood_model.yaml')
missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]

for ct in missing_cts:
    subtypes = cell_types.get_direct_subtypes(ct)
    annot = cells_annot.loc[cells_annot['Cell_type'].isin(subtypes)]
    annot.index
    expr = cells_expr[annot.index]
    annot['Cell_type'] = ct
    annot.index = annot.index + f'_{ct}'
    annot['Dataset'] = annot.index
    expr.columns = expr.columns + f'_{ct}'
    cells_expr = pd.concat([cells_expr, expr], axis=1)
    cells_annot = pd.concat([cells_annot, annot])

It will duplicate annotation and expressions for all the direct subtypes of "Monocytic_cells" (Monocytes, Macrophages) and "Immune_general" (T, B, NK, mono, etc). Then you can proceed with the training using original config.

jsangalang commented 1 year ago

Hello, I still believe there is an error with the training dataset provided on the website. I tried the additional patch you included, but there are still no "Dendritic_cells" cell type found in the training dataset from cell_types.yaml. I commented the Dendritic_cells from cell_types.yaml, which worked. Please address this issue in your dataset annotation.

model_column = 'Tumor_model_annot'
samples = data_annot.loc[data_annot['Tumor_model_annot'] == 'cancer_cells'].index
cancer_expr = data_expr[samples]
cancer_annot = data_annot.loc[samples]
cancer_annot['Tumor_type'] = cancer_annot['Dataset']
cancer_annot = cancer_annot[['Tumor_type', 'Dataset']]

samples = data_annot.loc[~data_annot[model_column].isna()].index
cells_expr = data_expr[samples]

cells_annot = data_annot.loc[samples]
cells_annot = cells_annot[[model_column, 'Dataset']]
cells_annot.columns = ['Cell_type', 'Dataset']
cells_annot = pd.concat([lab_annot, cells_annot])
cells_annot.loc[cells_annot['Dataset'].isna(), 'Dataset'] = cells_annot.loc[cells_annot['Dataset'].isna()].index
cells_expr = pd.concat([lab_expr, cells_expr], axis=1)

# to make sure that there is no repeated samples
samples = sorted(list(set(cells_annot.index).intersection(set(cells_expr.columns))))
cells_expr = cells_expr[samples]
cells_annot = cells_annot.loc[samples]

print(cells_expr.shape, cells_annot.shape)
print(cancer_expr.shape, cancer_annot.shape)

##############################

# Load cell types model

cell_types = CellTypes.load('configs/cell_types.yaml')
missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]
missing_cts

for ct in missing_cts:
    subtypes = cell_types.get_direct_subtypes(ct)
    annot = cells_annot.loc[cells_annot['Cell_type'].isin(subtypes)]
    annot.index
    expr = cells_expr[annot.index]
    annot['Cell_type'] = ct
    annot.index = annot.index + f'_{ct}'
    annot['Dataset'] = annot.index
    expr.columns = expr.columns + f'_{ct}'
    cells_expr = pd.concat([cells_expr, expr], axis=1)
    cells_annot = pd.concat([cells_annot, annot])

# to make sure that there is no repeated samples
samples = sorted(list(set(cells_annot.index).intersection(set(cells_expr.columns))))
cells_expr = cells_expr[samples]
cells_annot = cells_annot.loc[samples]
print(cells_expr.shape, cells_annot.shape)

Liuyw1217 commented 1 day ago

I also encountered the same problem, and I still reported an error after the revision.

data

cancer_sample_annot = pd.read_csv('data/cancer_samples_annot.tsv.tar.gz', sep='\t', index_col=0) cancer_expr = pd.read_csv('data/cancer_expr.tsv.tar.gz', sep='\t', index_col=0) cells_sample_annot = pd.read_csv('data/cells_samples_annot.tsv.tar.gz', sep='\t', index_col=0) cells_expr = pd.read_csv('data/cells_expr.tsv.tar.gz', sep='\t', index_col=0)

adding missing cell types

cell_types = CellTypes.load('configs/cell_types.yaml') missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_sample_annot['Cell_type'].unique()]

for ct in missing_cts: subtypes = cell_types.get_direct_subtypes(ct) annot = cells_sample_annot.loc[cells_sample_annot['Cell_type'].isin(subtypes)] annot.index expr = cells_expr[annot.index] annot['Celltype'] = ct annot.index = annot.index + f'{ct}' annot['Dataset'] = annot.index expr.columns = expr.columns + f'_{ct}' cells_expr = pd.concat([cells_expr, expr], axis=1) cells_sample_annot = pd.concat([cells_sample_annot, annot])

to make sure that there is no repeated samples

samples = sorted(list(set(cells_sample_annot.index).intersection(set(cells_expr.columns)))) cells_expr = cells_expr[samples] cells_sample_annot = cells_sample_annot.loc[samples] print(cells_expr.shape, cells_sample_annot.shape)

mixer = Mixer(cell_types=cell_types, cells_expr=cells_expr, cells_annot=cells_sample_annot, tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot, num_av=3, num_points=30) expr, values = mixer.generate('Dendritic_cells')

############################################################################# TypeError Traceback (most recent call last) Cell In[47], line 5 1 mixer = Mixer(cell_types=cell_types, 2 cells_expr=cells_expr, cells_annot=cells_sample_annot, 3 tumor_expr=cancer_expr, tumor_annot=cancer_sample_annot, 4 num_av=3, num_points=30) ----> 5 expr, values = mixer.generate('Dendritic_cells') 6 values

File ~/project/project6-PCOS/data/Kassandra_model_training/Kassandra/core/mixer.py:132, in Mixer.generate(self, modeled_cell, genes, random_seed) 126 mixed_cells_expr = pd.DataFrame(np.zeros((len(genes), self.num_points)), 127 index=genes, 128 columns=range(self.num_points), dtype=float) 130 cells_to_mix = self.get_cells_to_mix(modeled_cell) --> 132 average_cells = {self.generate_pure_cell_expressions(genes, 1, cells_to_mix), 133 self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])} 134 mixed_cells_values = self.dirichlet_mixing(self.num_points, cells_to_mix) 136 for cell in mixed_cells_values.index:

File ~/project/project6-PCOS/data/Kassandra_model_training/Kassandra/core/mixer.py:189, in Mixer.generate_pure_cell_expressions(self, genes, num_av, cells_to_mix) 187 specified_subtypes = set(self.proportions.dropna().index).intersection(cell_subtypes) 188 if len(specified_subtypes) > 1: --> 189 cells_index = self.change_subtype_proportions(cell=cell, 190 cells_index=cells_index) ... 2783 raise TypeError( 2784 "Passing a dict as an indexer is not supported. Use a list instead." 2785 )

TypeError: Passing a set as an indexer is not supported. Use a list instead.

BostonGene / Kassandra