ShobiStassen / VIA

trajectory inference
https://pyvia.readthedocs.io/en/latest/
MIT License
78 stars 20 forks source link

subPARC graph weight calculation #14

Closed GreenGilad closed 2 years ago

GreenGilad commented 2 years ago

Hi again,

I am going through the code of run_subPARC and notice that:

Is there a specific reason why we would want these values to be different? (It changes the weight values quiet a bit)

ShobiStassen commented 2 years ago

hi, the graph made in the make make_csrmatrix_noselfloop() function gives the csr matrix (csr_array_locallypruned) where we have distance + 0.01. This graph is used for clustering only and not for the random walk functions. The clusters derived from this stage are then used as the labels in the vertex graph constructed later (as ig_full_graph)

We create a second graph that is used for the random walk (pseudotime) purposes that is created with the distance +0.05 factor. This is the ig_full_graph (csr_full_graph in csr form, or igraph format in the ig_full_graph case) which is used for randomwalk related functions. The vertex clustergraph is made from the ig_full_graph where the vertex labels are designated by the clusters found in the clustering of the csr_array_locallypruned graph. As such there are two pruning parameters that can be set by the user. One for the clustergraph that is used for the randomwalk functions. And one for the graph that is input to the Leiden clustering method. It is also fine to do your own clustering and use these clusters to make the vertex clustergraph. I have tested this for the paper with kmeans clustering to show that one doesnt have to use the PARC or Leiden framework of clustering for VIA, but have not yet coded this in for the end user. (although one can simply pass in a variable that overwrites the Leiden cluster labels with your own clusterlabels as this is mainly used when setting up the vertex clustergraph used for random walks and Trajectory )

In the case of the vertex cluster graph used for the randomwalks, If the edgeweights are very high (in the case where you have a very small distance between two nodes) then they disproportionately dominate the randomwalk simulations and prevent certain allowable paths to occur within the span of the 1000 or so MCMC simulations that are run. If there are some very weak edges, which may either be spuriously weak (meaning they actually should be stronger) or truly weak (in which case we want them to remain relatively weaker than the strong edges) ,the simulations will never reach these nodes that are at the ends of very weak edges because the higher weight edges are too dominant. We therefore truncate the allowable range of values so that the difference between strong edges and weak ones does not overly distort the potential pathways that can be found in a fairly "small" number of simulations. One might be able to lower this value a bit more, but based on all the various datasets we tested, this small additive value of 0.05 for the randomwalk graphs worked reasonably. This graph is saved and then used again for the fine-grained iteration of VIA (when is_course == False), should you decide to feed the first coarse run of VIA into a second finer grained one. The rationale for running a coarse and then fine grained VIA is that the terminal states detected in the coarse run, are usually adequate, whereas doing a very fine grained run can give you better resolution of the pseudotime and pathway probabilities, but can also potentially suggest too many terminal states.