Error running VIA on generic disconnected datasets

kpandey008 commented 2 years ago

Hi,

I am trying to run VIA on a generic disconnected dataset using the following code as outlined in the README file:

v0 = via.VIA(
    preprocessed_data.obsm['X_pca'],
    preprocessed_data.obs['gt_clusters'],
    jac_std_global=0.15,
    dist_std_local=1,
    knn=30,
    too_big_factor=0.3,
    root_user=start_ids,
    dataset='',
    random_seed=0,
    resolution_parameter=resolution,
    preserve_disconnected=True
)
v0.run_VIA()

However I am getting the following stacktrace:

input data has shape 1966 (samples) x 10 (features)
time is Sun Nov  7 07:49:15 2021
commencing global pruning
Share of edges kept after Global Pruning 50.81 %
number of components in the original full graph 2
for downstream visualization purposes we are also constructing a low knn-graph 
size neighbor array in low-KNN in pca-space for visualization (1966, 4)
commencing community detection
time is Sun Nov  7 07:49:16 2021
12  clusters before handling small/big
There are 0 clusters that are too big
 : global cluster graph pruning level 0.15
number of components before pruning 2
number of connected componnents after reconnecting  2
percentage links trimmed from local pruning relative to start 0.0
percentage links trimmed from global pruning relative to start 37.5
there are  2 components in the graph
root user [877, 1717]
start computing lazy-teleporting Expected Hitting Times
/usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py:82: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_intXint(row, col, x.flat[0])
ended all multiprocesses, will retrieve and reshape
closeness  shortlist [0, 4, 6]
betweeness shortlist [0, 4]
out degree shortlist [0, 3, 5, 6]
terminal clus in this component [0, 6]
final terminal clus [0, 9]
From root 4  to Terminal state 0 is found 500  times.
From root 4  to Terminal state 6 is found 500  times.
/usr/local/lib/python3.7/dist-packages/pyVIA/core.py:2901: RuntimeWarning: invalid value encountered in true_divide
  bp_array = bp_array / bp_array.sum(axis=1)[:, None]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-175-01a58e604d64> in <module>()
      1 v0 = via.VIA(preprocessed_data.obsm['X_pca'], preprocessed_data.obs['gt_clusters'], jac_std_global=0.15, dist_std_local=1, knn=30,
      2              too_big_factor=0.3, root_user=start_ids, dataset='', random_seed=0, resolution_parameter=resolution)  # *.4 root=1,
----> 3 v0.run_VIA()

2 frames
/usr/local/lib/python3.7/dist-packages/pyVIA/core.py in run_VIA(self)
   3362         # Query dataset, k - number of closest elements (returns 2 numpy arrays)
   3363 
-> 3364         self.run_subPARC()
   3365         run_time = time.time() - time_start_total
   3366         print('time elapsed {:.1f} seconds'.format(run_time))

/usr/local/lib/python3.7/dist-packages/pyVIA/core.py in run_subPARC(self)
   2707                                                                                                         sc_labels_subi,
   2708                                                                                                         root_generic,
-> 2709                                                                                                         sc_truelabels_subi)
   2710 
   2711             self.root.append(root_i)

/usr/local/lib/python3.7/dist-packages/pyVIA/core.py in find_root_bcell(self, graph_dense, PARC_labels_leiden, root_user, true_labels)
   1987 
   1988             graph_node_label.append(str(majority_truth) + 'c' + str(cluster_i))
-> 1989         root = PARC_labels_leiden[root_user]
   1990         return graph_node_label, majority_truth_labels, deg_list, root
   1991 

IndexError: list index out of range

FWIW, the same code works fine for multifurcating datasets so I'm not sure if I'm missing something. Anyhelp would be appreciated!

ShobiStassen commented 2 years ago

Hi, Let me have a look in a couple of days, but it seems there is some issue with the root index (I'm guessing this is your own data and not the sample data, right?) My first guess would be that the index for the second root is is not aligned with the index numbering it has in the in the second component. Can you try by just setting the root to [1,1]? Then we would know if that is where the problem stems from? In the example 1b for disconnected toy, each root is designated to a particular component and this means labelling the true_label similar to how the Toy data labels are formatted for the disconnected toy data and then letting the data parameter be specified as "Toy4"- but before we go down that path (as I need to have a closer look myself) lets see what happens when you arbitrarily set the root to [1,1].

ShobiStassen commented 2 years ago

hi, i think the issue is now fixed and you can provide the indices of the cells corresponding to the roots of each component. you can clone the latest version on github or pip install pyVIA (version 0.1.19). also, if you want you can test it out based on the code in test_pyVIA.py by modifying the function:

´´´ def run_generic_discon(foldername ="/home/shobi/Trajectory/Datasets/Toy4/"):

df_counts = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000.csv",
                        delimiter=",")
df_ids = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000_ids_with_truetime.csv", delimiter=",")
df_ids['cell_id_num'] = [int(s[1::]) for s in df_ids['cell_id']]
df_counts = df_counts.drop('Unnamed: 0', 1)
df_ids = df_ids.sort_values(by=['cell_id_num'])
df_ids = df_ids.reset_index(drop=True)
true_label = df_ids['group_id']
true_label =['a' for i in true_label] #testing dummy true_label and overwriting the real true_labels
true_time = df_ids['true_time']
adata_counts = sc.AnnData(df_counts, obs=df_ids)
sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=100)
via.via_wrapper_disconnected(adata_counts, true_label, embedding=adata_counts.obsm['X_pca'][:, 0:2], root=[23, 902],
                         preserve_disconnected=True, knn=10, ncomps=30, cluster_graph_pruning_std=1, random_seed=41)

´´´

kpandey008 commented 2 years ago

@ShobiStassen The issue is resolved with the new release. Thank you for looking into it! Closing this issue

ShobiStassen / VIA

Error running VIA on generic disconnected datasets #9