SydneyBioX / BIDCell

Biologically-informed deep learning for cell segmentation of subcelluar spatial transcriptomics data
Other
35 stars 5 forks source link

KeyError related to UnassignedCodeword #26

Closed sbudoff closed 3 hours ago

sbudoff commented 4 hours ago

Hi,

Thank you for this segmentation tool.

I am encountering an issue related to transcripts not being filtered appropriately that first manifests in the pre-annotation phase while attempting to run BIDCell on 10x xenium data.

Specifically, after successfully make_cell_gene_mat the following error is printed:

Number of cells 23323 Number of splits for multiprocessing: 20 Done Number of cells: 23323 Traceback (most recent call last): File "/home/sam/BIDCell/example_s2r2.py", line 7, in <module> model.run_pipeline() File "/home/sam/BIDCell/bidcell/BIDCellModel.py", line 40, in run_pipeline self.preprocess() File "/home/sam/BIDCell/bidcell/BIDCellModel.py", line 65, in preprocess preannotate(self.config) File "/home/sam/BIDCell/bidcell/processing/preannotate.py", line 98, in preannotate df_ref = df_ref_orig[genes_cells + ct_columns] File "/home/sam/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 4108, in __getitem__ indexer = self.columns._get_indexer_strict(key, "columns")[1] File "/home/sam/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict self._raise_if_missing(keyarr, indexer, axis_name) File "/home/sam/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing raise KeyError(f"{not_found} not in index") KeyError: "['UnassignedCodeword_0003', 'UnassignedCodeword_0005', 'UnassignedCodeword_0007', 'UnassignedCodeword_0010', 'UnassignedCodeword_0011', 'UnassignedCodeword_0017', 'UnassignedCodeword_0018', 'UnassignedCodeword_0021', 'UnassignedCodeword_0022', 'UnassignedCodeword_0026', 'UnassignedCodeword_0032', 'UnassignedCodeword_0038', 'UnassignedCodeword_0041', 'UnassignedCodeword_0043', 'UnassignedCodeword_0044', 'UnassignedCodeword_0048', 'UnassignedCodeword_0050', 'UnassignedCodeword_0052', 'UnassignedCodeword_0059', 'UnassignedCodeword_0063', 'UnassignedCodeword_0064', 'UnassignedCodeword_0066', 'UnassignedCodeword_0067', 'UnassignedCodeword_0068', 'UnassignedCodeword_0070', 'UnassignedCodeword_0071', 'UnassignedCodeword_0080', 'UnassignedCodeword_0082', 'UnassignedCodeword_0085', 'UnassignedCodeword_0086', 'UnassignedCodeword_0089', 'UnassignedCodeword_0095', 'UnassignedCodeword_0096', 'UnassignedCodeword_0097', 'UnassignedCodeword_0100', 'UnassignedCodeword_0103', 'UnassignedCodeword_0106', 'UnassignedCodeword_0109', 'UnassignedCodeword_0112', 'UnassignedCodeword_0116', 'UnassignedCodeword_0118', 'UnassignedCodeword_0122', 'UnassignedCodeword_0125', 'UnassignedCodeword_0128', 'UnassignedCodeword_0129', 'UnassignedCodeword_0138', 'UnassignedCodeword_0139', 'UnassignedCodeword_0144', 'UnassignedCodeword_0150', 'UnassignedCodeword_0152', 'UnassignedCodeword_0153', 'UnassignedCodeword_0155', 'UnassignedCodeword_0161', 'UnassignedCodeword_0163', 'UnassignedCodeword_0167', 'UnassignedCodeword_0168', 'UnassignedCodeword_0174', 'UnassignedCodeword_0185', 'UnassignedCodeword_0191', 'UnassignedCodeword_0192', 'UnassignedCodeword_0194', 'UnassignedCodeword_0195', 'UnassignedCodeword_0196', 'UnassignedCodeword_0198', 'UnassignedCodeword_0205', 'UnassignedCodeword_0217', 'UnassignedCodeword_0218', 'UnassignedCodeword_0233', 'UnassignedCodeword_0235', 'UnassignedCodeword_0241', 'UnassignedCodeword_0245', 'UnassignedCodeword_0246', 'UnassignedCodeword_0247', 'UnassignedCodeword_0253', 'UnassignedCodeword_0258', 'UnassignedCodeword_0261', 'UnassignedCodeword_0270', 'UnassignedCodeword_0273', 'UnassignedCodeword_0275', 'UnassignedCodeword_0279', 'UnassignedCodeword_0400', 'UnassignedCodeword_0401', 'UnassignedCodeword_0402', 'UnassignedCodeword_0403', 'UnassignedCodeword_0404', 'UnassignedCodeword_0405', 'UnassignedCodeword_0406', 'UnassignedCodeword_0407', 'UnassignedCodeword_0408', 'UnassignedCodeword_0409', 'UnassignedCodeword_0410', 'UnassignedCodeword_0411', 'UnassignedCodeword_0412', 'UnassignedCodeword_0413', 'UnassignedCodeword_0414', 'UnassignedCodeword_0415', 'UnassignedCodeword_0416', 'UnassignedCodeword_0417', 'UnassignedCodeword_0418', 'UnassignedCodeword_0419', 'UnassignedCodeword_0420', 'UnassignedCodeword_0421', 'UnassignedCodeword_0422', 'UnassignedCodeword_0423', 'UnassignedCodeword_0424', 'UnassignedCodeword_0425', 'UnassignedCodeword_0426', 'UnassignedCodeword_0427', 'UnassignedCodeword_0428', 'UnassignedCodeword_0429', 'UnassignedCodeword_0430', 'UnassignedCodeword_0431', 'UnassignedCodeword_0432', 'UnassignedCodeword_0433', 'UnassignedCodeword_0434', 'UnassignedCodeword_0435', 'UnassignedCodeword_0436', 'UnassignedCodeword_0437', 'UnassignedCodeword_0438', 'UnassignedCodeword_0439', 'UnassignedCodeword_0440', 'UnassignedCodeword_0441', 'UnassignedCodeword_0442', 'UnassignedCodeword_0443', 'UnassignedCodeword_0444', 'UnassignedCodeword_0445', 'UnassignedCodeword_0446', 'UnassignedCodeword_0447', 'UnassignedCodeword_0448', 'UnassignedCodeword_0449', 'UnassignedCodeword_0450', 'UnassignedCodeword_0451', 'UnassignedCodeword_0452', 'UnassignedCodeword_0453', 'UnassignedCodeword_0454', 'UnassignedCodeword_0455', 'UnassignedCodeword_0456', 'UnassignedCodeword_0457', 'UnassignedCodeword_0458', 'UnassignedCodeword_0459', 'UnassignedCodeword_0460', 'UnassignedCodeword_0461', 'UnassignedCodeword_0462', 'UnassignedCodeword_0463', 'UnassignedCodeword_0464', 'UnassignedCodeword_0465', 'UnassignedCodeword_0466', 'UnassignedCodeword_0467', 'UnassignedCodeword_0468', 'UnassignedCodeword_0469', 'UnassignedCodeword_0470', 'UnassignedCodeword_0471', 'UnassignedCodeword_0472', 'UnassignedCodeword_0473', 'UnassignedCodeword_0474', 'UnassignedCodeword_0475', 'UnassignedCodeword_0476', 'UnassignedCodeword_0477', 'UnassignedCodeword_0478', 'UnassignedCodeword_0479', 'UnassignedCodeword_0480', 'UnassignedCodeword_0481', 'UnassignedCodeword_0482', 'UnassignedCodeword_0483', 'UnassignedCodeword_0484', 'UnassignedCodeword_0485', 'UnassignedCodeword_0486', 'UnassignedCodeword_0487', 'UnassignedCodeword_0488', 'UnassignedCodeword_0489', 'UnassignedCodeword_0490', 'UnassignedCodeword_0491', 'UnassignedCodeword_0492', 'UnassignedCodeword_0493', 'UnassignedCodeword_0494', 'UnassignedCodeword_0495', 'UnassignedCodeword_0496', 'UnassignedCodeword_0497', 'UnassignedCodeword_0498', 'UnassignedCodeword_0499'] not in index"

I am able to force preannotate.py to complete its task by adding the following regex and overwrite to it after your dataframe loading: # Cell expressions - order of gene names (columns) will be in same order as all_gene_names.txt df_cells = pd.read_csv(os.path.join(expr_dir, config.files.fp_expr), index_col=0)

df_cells = df_cells.filter(regex=r'^(?!UnassignedCodeword_).*$') df_cells.to_csv(os.path.join(expr_dir, config.files.fp_expr))

This does not resolve the underlying issue, as later in train your script breaks again I assume due to the same UnassignedCodewords. I previously confirmed your example_small.py file ran without issue on my computer, so am using a modified version of your params_smallexample.yaml. Specifically, I modified it such that all files contain the paths to my data and references, I adjusted the elongated cells to match my reference sets, and added "UnassignedCodeword" to the list of transcripts_to_filter.

Given this, how do you suggest I proceed?

Thank you, Sam

sbudoff commented 3 hours ago

Update, the bug in train.py actually was unrelated to the above issue, which I resolved by rerunning each step in the pipeline indiviudally with a rewritten yaml file. Instead, the training bug was coming from line 297 of dataset_input.py. I can force it to run by subtracting one from the ct_nucleus index but am worried this may have other adverse impacts later. Relevantly, my dataset contains 134 unique cell types, and the observation that led me to this hack was that index 134 would be computed by ct_nucleus prior to the train.py stalling. Please let me know what you think I should do to avoid this in hack in general.

ct_nucleus = int(self.nuclei_types_idx[self.nuclei_types_ids.index(c_id)])-1

sbudoff commented 3 hours ago

Final update, this hack worked because I incorrectly indexed the ct_idx column from 1 as I made that csv in R!