JCVenterInstitute / NSForest

A machine learning method for the discovery of the minimum marker gene combinations for cell type identification from single-cell RNA sequencing
MIT License
53 stars 20 forks source link

KeyError 0 when running on AnnData object created from Seurat #4

Closed achamess closed 3 months ago

achamess commented 3 years ago

I converted a Seurat object to AnnData using SeuratDisk. It seemed to work.

These are the features of the resulting object:

AnnData object with n_obs × n_vars = 17912 × 3000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'status', 'shared_assignment', 'assignment', 'axis', 'log10GenesPerUMI', 'mitoRatio', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.4', 'SCT_snn_res.0.6', 'SCT_snn_res.0.8', 'SCT_snn_res.1', 'SCT_snn_res.1.4', 'SCT_snn_res.2', 'seurat_clusters', 'S.Score', 'G2M.Score', 'Phase'
    var: 'features'
    uns: 'neighbors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances'

The cluster assignment I want to use is 'SCT_snn_res.0.8'. I changed this to dtype category and then changed the function call to NS_Forest to reflect that I want to use this column.

When I run adata_markers = NS_Forest(adata)

It starts up, but I get this error:

22
0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: '0'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-6cd156db933d> in <module>
----> 1 adata_markers = NS_Forest(adata)

<ipython-input-4-5528acf1c438> in NS_Forest(adata, clusterLabelcolumnHeader, rfTrees, Median_Expression_Level, Genes_to_testing, betaValue)
    202 
    203         #Rerank according to expression level and binary score
--> 204         Positive_RankedList_Complete = negativeOut(RankedList, column, medianValues, Median_Expression_Level)
    205         print(Positive_RankedList_Complete)
    206 

<ipython-input-4-5528acf1c438> in negativeOut(x, column, medianValues, Median_Expression_Level)
     48         Positive_RankedList_Complete = []
     49         for i in x:
---> 50             if medianValues.loc[column, i] > Median_Expression_Level:
     51                 print(i)
     52                 print(medianValues.loc[column, i])

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    887                     # AttributeError for IntervalTree get_value
    888                     return self.obj._get_value(*key, takeable=self._takeable)
--> 889             return self._getitem_tuple(key)
    890         else:
    891             # we by definition only have the 0th axis

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1058     def _getitem_tuple(self, tup: Tuple):
   1059         with suppress(IndexingError):
-> 1060             return self._getitem_lowerdim(tup)
   1061 
   1062         # no multi-index, so validate all of the indexers

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    805                 # We don't need to check for tuples here because those are
    806                 #  caught by the _is_nested_tuple_indexer check above.
--> 807                 section = self._getitem_axis(key, axis=i)
    808 
    809                 # We should never have a scalar section here, because

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1122         # fall thru to straight lookup
   1123         self._validate_key(key, axis)
-> 1124         return self._get_label(key, axis=axis)
   1125 
   1126     def _get_slice_axis(self, slice_obj: slice, axis: int):

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1071     def _get_label(self, label, axis: int):
   1072         # GH#5667 this will fail if the label is not present in the axis.
-> 1073         return self.obj.xs(label, axis=axis)
   1074 
   1075     def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):

/opt/conda/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3737                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3738         else:
-> 3739             loc = index.get_loc(key)
   3740 
   3741             if isinstance(loc, np.ndarray):

/opt/conda/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: '0'
BAevermann commented 3 years ago

Thanks for the feedback. In previous versions I appended "cluster_" to the number to avoid these python type problems, but I thought the current version was ok with numeric cluster labels. What is the dtype category you are currently using?

In a test analysis that i have up , the cluster assignment vector, louvain in this case, is:

Name: louvain, Length: 1846, dtype: category Categories (6, object): ['0', '1', '2', '3', '4', '5']

achamess commented 3 years ago

Thanks for the response. Sorry for my late reply. Initially, the cluster label is numeric but I did force it to a category and I get the error still

D1_GTGGAGATCTGCTTAT    15
D1_GCACGGTCACTCAGAT    15
D1_TATACCTGTCTTACTT    15
D1_TATATCCAGAGCATCG    15
D1_CGTGATAAGGTATCTC    15
                       ..
V2_CTTTCGGAGCTCGACC    18
V2_AGAGCCCAGGAAGTAG     8
V2_AACGAAACAATAAGGT    23
V2_AAGTACCTCGCATTAG     8
V2_GTTCGCTGTTGGGACA     8
Name: L1_Round4, Length: 17906, dtype: category
Categories (26, object): ['0', '1', '2', '3', ..., '22', '23', '24', '25']
Gene233 commented 2 years ago

Thanks for the response. Sorry for my late reply. Initially, the cluster label is numeric but I did force it to a category and I get the error still

D1_GTGGAGATCTGCTTAT    15
D1_GCACGGTCACTCAGAT    15
D1_TATACCTGTCTTACTT    15
D1_TATATCCAGAGCATCG    15
D1_CGTGATAAGGTATCTC    15
                       ..
V2_CTTTCGGAGCTCGACC    18
V2_AGAGCCCAGGAAGTAG     8
V2_AACGAAACAATAAGGT    23
V2_AAGTACCTCGCATTAG     8
V2_GTTCGCTGTTGGGACA     8
Name: L1_Round4, Length: 17906, dtype: category
Categories (26, object): ['0', '1', '2', '3', ..., '22', '23', '24', '25']

The same error as me! Don't know what's wrong with the data...

yunzhang813 commented 3 months ago

Thanks of the ticket. Code refactored in v4.0.