dpeerlab / Palantir

Single cell trajectory detection
https://palantir.readthedocs.io
GNU General Public License v2.0
203 stars 45 forks source link

type KeyError with new v1.3.1: output from determine_multiscale_space #124

Closed zktuong closed 9 months ago

zktuong commented 9 months ago

Hi,

i'm trying to run a simple chunk like so:

pca_projections = pd.DataFrame(pb_adata.obsm["X_pca"], index=pb_adata.obs_names)
dm_res = palantir.utils.run_diffusion_maps(pca_projections, n_components=5)
ms_data = palantir.utils.determine_multiscale_space(dm_res)

pr_res = palantir.core.run_palantir(
    ms_data,
    pb_adata.obs_names[rootcell],
    num_waypoints=500,
    terminal_states=terminal_states.index,
)

but it's triggering an error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/uqztuong/Library/CloudStorage/OneDrive-TheUniversityofQueensland/Documents/GitHub/dandelion/docs/notebooks/8-pseudobulk-trajectory.ipynb Cell 20 line 1
     14 dm_res = palantir.utils.run_diffusion_maps(pca_projections, n_components=5)
     15 ms_data = palantir.utils.determine_multiscale_space(dm_res)
---> 17 pr_res = palantir.core.run_palantir(
     18     ms_data,
     19     pb_adata.obs_names[rootcell],
     20     num_waypoints=500,
     21     terminal_states=terminal_states.index,
     22 )
     24 pr_res.branch_probs.columns = terminal_states[pr_res.branch_probs.columns]

File ~/Library/CloudStorage/OneDrive-TheUniversityofQueensland/Documents/GitHub/Palantir/src/palantir/core.py:129, in run_palantir(data, early_cell, terminal_states, knn, num_waypoints, n_jobs, scale_components, use_early_cell_as_start, max_iterations, eigvec_key, pseudo_time_key, entropy_key, fate_prob_key, save_as_df, waypoints_key, seed)
    125 # ################################################
    126 # Determine the boundary cell closest to user defined early cell
    127 dm_boundaries = pd.Index(set(data_df.idxmax()).union(data_df.idxmin()))
    128 dists = pairwise_distances(
--> 129     data_df.loc[dm_boundaries, :], data_df.loc[early_cell, :].values.reshape(1, -1)
    130 )
    131 start_cell = pd.Series(np.ravel(dists), index=dm_boundaries).idxmin()
    132 if use_early_cell_as_start:

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
   1065     if self._is_scalar_access(key):
   1066         return self.obj._get_value(*key, takeable=self._takeable)
-> 1067     return self._getitem_tuple(key)
   1068 else:
   1069     # we by definition only have the 0th axis
   1070     axis = self.axis or 0

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexing.py:1247, in _LocIndexer._getitem_tuple(self, tup)
   1245 with suppress(IndexingError):
   1246     tup = self._expand_ellipsis(tup)
-> 1247     return self._getitem_lowerdim(tup)
   1249 # no multi-index, so validate all of the indexers
   1250 tup = self._validate_tuple_indexer(tup)

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexing.py:967, in _LocationIndexer._getitem_lowerdim(self, tup)
    963 for i, key in enumerate(tup):
    964     if is_label_like(key):
    965         # We don't need to check for tuples here because those are
    966         #  caught by the _is_nested_tuple_indexer check above.
--> 967         section = self._getitem_axis(key, axis=i)
    969         # We should never have a scalar section here, because
    970         #  _getitem_lowerdim is only called after a check for
    971         #  is_scalar_access, which that would be.
    972         if section.ndim == self.ndim:
    973             # we're in the middle of slicing through a MultiIndex
    974             # revise the key wrt to `section` by inserting an _NS

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis)
   1310 # fall thru to straight lookup
   1311 self._validate_key(key, axis)
-> 1312 return self._get_label(key, axis=axis)

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis)
   1258 def _get_label(self, label, axis: int):
   1259     # GH#5567 this will fail if the label is not present in the axis.
-> 1260     return self.obj.xs(label, axis=axis)

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/generic.py:4056, in NDFrame.xs(self, key, axis, level, drop_level)
   4054             new_index = index[loc]
   4055 else:
-> 4056     loc = index.get_loc(key)
   4058     if isinstance(loc, np.ndarray):
   4059         if loc.dtype == np.bool_:

File /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/pandas/core/indexes/range.py:395, in RangeIndex.get_loc(self, key, method, tolerance)
    393             raise KeyError(key) from err
    394     self._check_indexing_error(key)
--> 395     raise KeyError(key)
    396 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: '710'

my pb_adata.obs_names are ['0', '1', '2', ... '1360']

in order to solve this, i had to do:

pca_projections = pd.DataFrame(pb_adata.obsm["X_pca"], index=pb_adata.obs_names)
dm_res = palantir.utils.run_diffusion_maps(pca_projections, n_components=5)
ms_data = palantir.utils.determine_multiscale_space(dm_res)
ms_data.index  = ms_data.index.astype(str)

pr_res = palantir.core.run_palantir(
    ms_data,
    adata.obs_names[rootcell],
    num_waypoints=500,
    terminal_states=terminal_states.index,
)

This isn't an issue in v1.3.0 but occuring for me in v1.3.1.

i see that you've made some changes to run_diffusion_maps recently - can you think of why this is happening?

katosh commented 9 months ago

Hi @zktuong,

Thank you for your detailed report. I've reviewed the changes between v1.3.0 and v1.3.1 in palantir.core: Comparison Link. Surprisingly, I found no modifications affecting your case. Additionally, the failing call data_df.loc[early_cell, :] remains unchanged in v1.3.0: Commit Reference.

Likewise, palantir.utils.determine_multiscale_space appears invariant for your use case.

For debugging, could you check if the index data type undergoes a conversion, perhaps to a categorical type? This may be due to varying versions of Scanpy or pandas.

As an immediate remedy, Palantir v1.3.1 supports direct AnnData object input, eliminating the need to manually create DataFrames. Here's an example; adjust parameters as needed.

Looking forward to your insights and whether this mitigates your issue.

zktuong commented 9 months ago

hi there, thanks for the prompt response!

your solution works!

However, for completeness, the source of the issue is with dm_res["EigenVectors"].index returning a RangeIndex instead of Index in v1.3.1.

I tested it on v.1.2.0, 1.3.0 and 1.3.1 out over here: https://github.com/zktuong/troubleshooting_palantir/ (look at cell 8 in the 3 notebooks)

I suppose this is taken care of within anndata but if a user doesn't want to use anndata and just wants to use a pandas dataframe, then it will cause this issue.

zktuong commented 9 months ago

so looking at the code, it would be here:

https://github.com/dpeerlab/Palantir/blob/580ac87b12383c8356b644ec9673ba983408afc1/src/palantir/utils.py#L397-L398

not sure how to adjust it without breaking stuff...

katosh commented 9 months ago

Thanks for the insightful analysis! A refactor inadvertently omitted lines that manage the index; this has been rectified in this commit. For those keen to test the hotfix, execute the following:

pip install 'git+https://github.com/settylab/Palantir'
katosh commented 9 months ago

Please free feel to report any feedback or reopen the issue if it persists regardless of the patch!

zktuong commented 9 months ago

thanks for the swift update! works now!