LouisFaure / scFates

a scalable python suite for tree inference and advanced pseudotime analysis from scRNAseq data.
https://scfates.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
50 stars 2 forks source link

Error when `scf.tl.dendrogram()` #30

Closed Fantasque68 closed 7 months ago

Fantasque68 commented 8 months ago

Thanks to @LouisFaure for this great pseudotime analysis package. It has helped me a lot.

However, in recent days, I encountered an error I haven't met before, when I performing scf.tl.dendrogram() on my own dataset. In the past few month, this function could perform normally.

The error message shows ValueError: no such vertex, plus the start node I selected. Such as ValueError: no such vertex: '10'".

What's that mean? And how can I solve that?

By the way, all other functions before perform scf.tl.dendrogram() normally in my code.

Thanks!

The anndata object

AnnData object with n_obs × n_vars = 5536 × 21556 obs: 'type', 'name', 'L1', 'L2', 'L3', 'L4', 'anno', 't', 'seg', 'edge', 't_sd', 'milestones' uns: 'L4_colors', '_scvi_manager_uuid', '_scvi_uuid', 'log1p', 'neighbors', 'umap', 'anno_colors', 'draw_graph', 'graph', 'ppt', 'pseudotime_list', 'milestones_colors', 'seg_colors', 'type_colors' obsm: 'X_scVI', 'X_umap', '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs', 'X_umap2d', 'X_draw_graph_fa', 'X_R' layers: 'counts', 'scVI_normalized' obsp: 'connectivities', 'distances'

The codes are pasted here.

# explor sigma
sig=scf.tl.explore_sigma(NK2,
                         Nodes=50,
                         nsteps=10,
                         use_rep="X_umap",
                         seed=42,plot=True)

# draw ForceAtlas2 embedding using 2 first PCs as initial positions
NK2.obsm["X_umap2d"]=NK2.obsm["X_umap"][:,:2]
sc.tl.draw_graph(NK2,init_pos='X_umap2d')

scf.tl.tree(NK2,
            method="ppt",
            Nodes=100,
            use_rep="X_umap",
            device="cpu",
            seed=42,
            ppt_lambda=500,
            ppt_sigma=1,
            ppt_nsteps=200)

fig, ax = plt.subplots(1, 1, figsize = (10, 10))
scf.pl.graph(NK2, ax = ax,
             alpha_nodes=0.3,
             basis = "umap")

# clean branch
scf.tl.cleanup(NK2,
               leaves = [96])
fig, ax = plt.subplots(1,1, figsize = (10, 10))
scf.pl.graph(NK2, ax = ax,
             alpha_nodes=0.3,
             basis = "umap")

# Start node selection
scf.tl.root(NK2, 10)
scf.tl.pseudotime(NK2,n_map=100,n_jobs=20,seed=42)
scf.pl.trajectory(NK2)
sc.pl.draw_graph(NK2,
                 color=["seg","milestones", "anno"])

scf.tl.rename_milestones(NK2,
                         new={"67": "proNK2",
                              "66": "NK2-1",
                              "15": "NK2-0"})
# change the color of the root milestone for better visualisations
NK2.uns["milestones_colors"][2]="#17bece"

fig, ax = plt.subplots(1,1, figsize = (5, 5))
sc.pl.draw_graph(NK2,color=["milestones"],
                 ax = ax,
                 size = 20)

# Dendrogram
scf.tl.dendrogram(NK2)
scf.pl.dendrogram(NK2,color="seg")

And the error message.

{
    "name": "ValueError",
    "message": "no such vertex: '10'",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 15
      9 sc.pl.draw_graph(NK2,color=[\"milestones\"],
     10                  ax = ax,
     11                  size = 20)
     13 # Dendrogram
     14 # .......
---> 15 scf.tl.dendrogram(NK2)
     16 scf.pl.dendrogram(NK2,color=\"seg\")
     18 fig, ax = plt.subplots(1, 1, figsize = (7, 7))

File ~/anaconda3/envs/scrnaseq/lib/python3.8/site-packages/scFates/tools/dendrogram.py:67, in dendrogram(adata, crowdedness, n_jobs)
     65 img = igraph.Graph(directed=True)
     66 img.add_vertices(vals.astype(str))
---> 67 img.add_edges(edges)
     69 img.vs[\"label\"] = keys
     71 pos = hierarchy_pos(img.to_networkx(), vert_gap=0.1)

File ~/anaconda3/envs/scrnaseq/lib/python3.8/site-packages/igraph/basic.py:42, in _add_edges(graph, es, attributes)
     32 \"\"\"Adds some edges to the graph.
     33 
     34 @param es: the list of edges to be added. Every edge is represented
   (...)
     39   edges.
     40 \"\"\"
     41 eid = graph.ecount()
---> 42 res = GraphBase.add_edges(graph, es)
     43 n = graph.ecount() - eid
     44 if (attributes is not None) and (n > 0):

ValueError: no such vertex: '10'"
}
LouisFaure commented 8 months ago

This is intriguing, the tree seems to be missing node number "10", are you also seeing this on the tutorials? igraph recently got a new version, but my own tests passed without issue. If the tutorials runs well, then send me a reduced version of your dataset (with one gene only, the important elements are in uns, obs and obsm) so I can have a quick look at it

Fantasque68 commented 8 months ago

Your timely reply is greatly appreciated.

  1. Firstly, I have run the tutorials again, all codes run smoothly without any error. scf.tl.dendrogram() also performed normally. Besides, the same workflow worked well for another dataset of mine serveral days ago.

  2. My igraph version is 0.10.8, and I haven't updated it since its installation in last year.

  3. I have mailed the onedrive link to you.

Thanks!

LouisFaure commented 8 months ago

It looks like running scf.tl.pseudotime with multiple mappings lead to a loss of the progenitor segment (using a single mapping recovers it). This is likely because that segment is quite short, with the few cells composing it being too close from the two other segments. This leads them to be assigned to these in many mappings.

This indicates an instability of the learned tree, I would recommend to explore other more stable topologies.

In the future, I might implement an exception that would prevent the scf.tl.pseudotime if it leads to an affected tree structure.

Fantasque68 commented 8 months ago

Thanks for your help, but I don't quite understand what "scf.tl.pseudotime with multiple mappings" means, and how to using a single mapping?

  • Here is the tree before cleaning. image
  • Here is the tree after cleaning. I chose branch == 96 to cut. image
LouisFaure commented 8 months ago

Pseudotime is calculated as assigning to each cell to its closest principal point/node. As described in the methods, when with multiple mappings, pseudotime assignment is performed n times by selecting the closest node randomly using the values from R soft assignment matrix as probabilities.

To summarize over all mapping, the mean pseudotime is calculated: This can lead to pseudotime values being higher or lower to the assigned segment. This happens especially when ppt_sigma is high, as this lead to higher R node soft assignment to a broader region around a given cells. In that case cells from progenitor branch would get often assigned to the two other branches due to proximity.

Running over one mapping is as simple as setting the parameter n_map=1, but the fact that this is the solution shows that the tree lacks robustness in the first place.

Fantasque68 commented 7 months ago

Pseudotime is calculated as assigning to each cell to its closest principal point/node. As described in the methods, when with multiple mappings, pseudotime assignment is performed n times by selecting the closest node randomly using the values from R soft assignment matrix as probabilities.

To summarize over all mapping, the mean pseudotime is calculated: This can lead to pseudotime values being higher or lower to the assigned segment. This happens especially when ppt_sigma is high, as this lead to higher R node soft assignment to a broader region around a given cells. In that case cells from progenitor branch would get often assigned to the two other branches due to proximity.

Running over one mapping is as simple as setting the parameter n_map=1, but the fact that this is the solution shows that the tree lacks robustness in the first place.

Thanks for your patience and time, I figured it out now. The issue will be closed.