ShobiStassen / VIA

trajectory inference
https://pyvia.readthedocs.io/en/latest/
MIT License
78 stars 20 forks source link

run_VIA() error index 125 is out of bounds for axis 0 with size 2 #8

Closed tianlt closed 2 years ago

tianlt commented 2 years ago

Hi,

The run_VIA() function returns the IndexError: index 125 is out of bounds for axis 0 with size 2. I do twice run_VIA() according to the instruction. The first time is fine but the second time such error is reported. The code is run for the second time is;

tsi_list = via.get_loc_terminal_states(v_1, adata.obsm['X_pca'][:, 0:2]) v_2 = via.VIA(adata.obsm['X_pca'][:, 0:2], true_label_1, jac_std_global=0.15, dist_std_local=1, knn=5, too_big_factor=0.05, super_cluster_labels=v_1.labels, super_node_degree_list=v_1.node_degree_list, super_terminal_cells=tsi_list, root_user=root, is_coarse=False, full_neighbor_array=v_1.full_neighbor_array, full_distance_array=v_1.full_distance_array, ig_full_graph=v_1.ig_full_graph, csr_array_locally_pruned=v_1.csr_array_locally_pruned, x_lazy=0.95, alpha_teleport=0.99, preserve_disconnected=True, super_terminal_clusters=v_1.terminal_clusters, random_seed=1, name='v_2')

v_2.run_VIA()

The output including the error is; input data has shape 1000 (samples) x 2 (features) time is Sun Nov 7 14:06:54 2021 commencing global pruning Share of edges kept after Global Pruning 48.47 % commencing community detection time is Sun Nov 7 14:06:54 2021 397 clusters before handling small/big There are 0 clusters that are too big humanCD34 : global cluster graph pruning level 0.15 number of components before pruning 2 number of connected componnents after reconnecting 2 percentage links trimmed from local pruning relative to start 0.0 percentage links trimmed from global pruning relative to start 48.7 there are 2 components in the graph root user [0] start computing lazy-teleporting Expected Hitting Times ended all multiprocesses, will retrieve and reshape super_terminal_clusters [129, 130, 3, 5, 6, 134, 8, 10, 11, 136, 13, 14, 139, 16, 17, 18, 20, 21, 24, 28, 31, 32, 161, 35, 38, 40, 42, 43, 45, 54, 56, 58, 60, 62, 64, 68, 69, 71, 72, 73, 76, 78, 84, 88, 90, 93, 99, 102, 104, 106, 109, 110, 115, 117, 118, 119, 120, 126] terminal clus in this component [128, 129, 3, 5, 6, 133, 8, 10, 11, 135, 13, 14, 138, 16, 17, 18, 20, 21, 24, 28, 31, 32, 160, 35, 38, 40, 42, 43, 45, 54, 56, 58, 59, 61, 63, 67, 68, 70, 71, 72, 75, 77, 83, 87, 89, 92, 98, 101, 103, 105, 108, 109, 114, 116, 117, 118, 119, 125] final terminal clus [129, 130, 3, 5, 6, 134, 8, 10, 11, 136, 13, 14, 139, 16, 17, 18, 20, 21, 24, 28, 31, 32, 161, 35, 38, 40, 42, 43, 45, 54, 56, 58, 60, 62, 64, 68, 69, 71, 72, 73, 76, 78, 84, 88, 90, 93, 99, 102, 104, 106, 109, 110, 115, 117, 118, 119, 120, 126] From root 26 to Terminal state 128 is found 207 times. From root 26 to Terminal state 129 is found 5 times. From root 26 to Terminal state 3 is found 5 times. From root 26 to Terminal state 5 is found 5 times. From root 26 to Terminal state 6 is found 5 times. From root 26 to Terminal state 133 is found 5 times. From root 26 to Terminal state 8 is found 5 times. From root 26 to Terminal state 10 is found 5 times. From root 26 to Terminal state 11 is found 5 times. From root 26 to Terminal state 135 is found 5 times. From root 26 to Terminal state 13 is found 5 times. From root 26 to Terminal state 14 is found 5 times. From root 26 to Terminal state 138 is found 5 times. From root 26 to Terminal state 16 is found 5 times. From root 26 to Terminal state 17 is found 5 times. From root 26 to Terminal state 18 is found 5 times. From root 26 to Terminal state 20 is found 5 times. From root 26 to Terminal state 21 is found 5 times. From root 26 to Terminal state 24 is found 5 times. From root 26 to Terminal state 28 is found 5 times. From root 26 to Terminal state 31 is found 5 times. From root 26 to Terminal state 32 is found 19 times. From root 26 to Terminal state 160 is found 5 times. From root 26 to Terminal state 35 is found 5 times. From root 26 to Terminal state 38 is found 5 times. From root 26 to Terminal state 40 is found 5 times. From root 26 to Terminal state 42 is found 5 times. From root 26 to Terminal state 43 is found 5 times. From root 26 to Terminal state 45 is found 5 times. From root 26 to Terminal state 54 is found 5 times. From root 26 to Terminal state 56 is found 5 times. From root 26 to Terminal state 58 is found 5 times. From root 26 to Terminal state 59 is found 5 times. From root 26 to Terminal state 61 is found 5 times. From root 26 to Terminal state 63 is found 5 times. From root 26 to Terminal state 67 is found 5 times. From root 26 to Terminal state 68 is found 5 times. From root 26 to Terminal state 70 is found 5 times. From root 26 to Terminal state 71 is found 5 times. From root 26 to Terminal state 72 is found 5 times. From root 26 to Terminal state 75 is found 5 times. From root 26 to Terminal state 77 is found 5 times. From root 26 to Terminal state 83 is found 5 times. From root 26 to Terminal state 87 is found 5 times. From root 26 to Terminal state 89 is found 205 times. From root 26 to Terminal state 92 is found 5 times. From root 26 to Terminal state 98 is found 5 times. From root 26 to Terminal state 101 is found 5 times. From root 26 to Terminal state 103 is found 5 times. From root 26 to Terminal state 105 is found 5 times. From root 26 to Terminal state 108 is found 77 times. From root 26 to Terminal state 109 is found 5 times. From root 26 to Terminal state 114 is found 9 times. From root 26 to Terminal state 116 is found 5 times. From root 26 to Terminal state 117 is found 5 times. From root 26 to Terminal state 118 is found 5 times. From root 26 to Terminal state 119 is found 5 times. From root 26 to Terminal state 125 is found 5 times. start computing lazy-teleporting Expected Hitting Times ended all multiprocesses, will retrieve and reshape super_terminal_clusters [129, 130, 3, 5, 6, 134, 8, 10, 11, 136, 13, 14, 139, 16, 17, 18, 20, 21, 24, 28, 31, 32, 161, 35, 38, 40, 42, 43, 45, 54, 56, 58, 60, 62, 64, 68, 69, 71, 72, 73, 76, 78, 84, 88, 90, 93, 99, 102, 104, 106, 109, 110, 115, 117, 118, 119, 120, 126] no sub cluster has majority made of super-cluster 129

IndexError Traceback (most recent call last) /tmp/ipykernel_60176/2984432610.py in 8 super_terminal_clusters=v_1.terminal_clusters, random_seed=1, name='v_2') 9 ---> 10 v_2.run_VIA()

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_VIA(self) 3389 # Query dataset, k - number of closest elements (returns 2 numpy arrays) 3390 -> 3391 self.run_subPARC() 3392 run_time = time.time() - time_start_total 3393 print('time elapsed {:.1f} seconds'.format(run_time))

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_subPARC(self) 2854 sub_terminal_clustemp.append(most_likely_sub_terminal) 2855 -> 2856 if (markov_hitting_times_ai[loc_j_in_sub_ai] > np.percentile( 2857 np.asarray(markov_hitting_times_ai), self.pseudotime_threshold_TS)): # 30 2858

IndexError: index 125 is out of bounds for axis 0 with size 2

Thanks in advance Tian

ShobiStassen commented 2 years ago

hi,

i think the issue here is that in the second iteration you are getting two disconnected components and I'm not sure if this is also the case in the first iteration. If the first pass (v_1) also yields two disconnected components, then I would suggest only running a single pass as it is not so easy to carry forward the terminal states from the first iteration onto the second iteration. If you mainly just want more granularity then you can lower the too_big_factor parameter already in v_1 (set it to 0.2 or 0.15, instead of 0.3) and see if you get a more fine-grained trajectory (or even multiple disconnected trajectories). You might also consider using more than just the first two PCs for your input as VIA can handle higher number of input dimensions.

tianlt commented 2 years ago

Thank you for reply and explanation. I tuned both the too_big_factor and the number of PCs as input and only use the first pass. It runs successfully on one dataset. But on another dataset i got another error; IndexError: color index must be non-negative.

The full error and output I got are;

/home/tialan/viaenv/lib/python3.7/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index. warnings.warn("Transforming to str index.", ImplicitModificationWarning) input data has shape 1000 (samples) x 5 (features) time is Tue Nov 9 14:40:04 2021 commencing global pruning Share of edges kept after Global Pruning 46.72 % number of components in the original full graph 2 for downstream visualization purposes we are also constructing a low knn-graph size neighbor array in low-KNN in pca-space for visualization (1000, 5) commencing community detection time is Tue Nov 9 14:40:04 2021 285 clusters before handling small/big There are 0 clusters that are too big humanCD34 : global cluster graph pruning level 0.15 number of components before pruning 2 number of connected componnents after reconnecting 2 percentage links trimmed from local pruning relative to start 0.0 percentage links trimmed from global pruning relative to start 56.9 there are 2 components in the graph root user [0] start computing lazy-teleporting Expected Hitting Times /home/tialan/viaenv/lib/python3.7/site-packages/scipy/sparse/_index.py:82: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. self._set_intXint(row, col, x.flat[0]) ended all multiprocesses, will retrieve and reshape closeness shortlist [0, 1, 2, 3, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 21, 22, 23, 24, 27, 28, 30, 31, 33, 34, 35, 39, 40, 41, 46, 47, 52, 54, 56, 57, 63, 65, 73] betweeness shortlist [1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 17, 20, 21, 22, 23, 28, 29, 31, 33, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 53, 55, 56, 57, 60, 61, 63, 64, 66, 67, 68, 70, 73] out degree shortlist [2, 3, 4, 7, 8, 9, 11, 12, 13, 15, 16, 17, 18, 21, 22, 23, 24, 26, 27, 28, 29, 31, 33, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 48, 49, 53, 55, 57, 61, 66, 67, 70, 71, 73, 74] TS 1 had 2 or more neighboring terminal states, namely [10, 41, 57] and so we removed, 1 TS 7 had 2 or more neighboring terminal states, namely [18, 22] and so we removed, 18 TS 9 had 2 or more neighboring terminal states, namely [3, 13] and so we removed, 3 TS 13 had 2 or more neighboring terminal states, namely [9, 31] and so we removed, 9 TS 21 had 2 or more neighboring terminal states, namely [23, 28] and so we removed, 21 TS 22 had 2 or more neighboring terminal states, namely [7, 28] and so we removed, 7 TS 35 had 2 or more neighboring terminal states, namely [8, 23] and so we removed, 35 terminal clus in this component [2, 4, 8, 10, 13, 16, 17, 22, 23, 24, 28, 29, 31, 37, 38, 39, 41, 42, 43, 44, 47, 48, 49, 53, 55, 57, 66, 67, 70, 73] final terminal clus [2, 4, 8, 10, 13, 16, 17, 22, 23, 24, 28, 29, 31, 37, 38, 39, 41, 42, 43, 44, 47, 49, 50, 54, 56, 58, 67, 68, 71, 74] From root 14 to Terminal state 2 is found 5 times. From root 14 to Terminal state 4 is found 5 times. From root 14 to Terminal state 8 is found 489 times. From root 14 to Terminal state 10 is found 5 times. From root 14 to Terminal state 13 is found 5 times. From root 14 to Terminal state 16 is found 485 times. From root 14 to Terminal state 17 is found 7 times. From root 14 to Terminal state 22 is found 486 times. From root 14 to Terminal state 23 is found 488 times. From root 14 to Terminal state 24 is found 5 times. From root 14 to Terminal state 28 is found 493 times. From root 14 to Terminal state 29 is found 5 times. From root 14 to Terminal state 31 is found 5 times. From root 14 to Terminal state 37 is found 5 times. From root 14 to Terminal state 38 is found 6 times. From root 14 to Terminal state 39 is found 5 times. From root 14 to Terminal state 41 is found 5 times. From root 14 to Terminal state 42 is found 5 times. From root 14 to Terminal state 43 is found 5 times. From root 14 to Terminal state 44 is found 5 times. From root 14 to Terminal state 47 is found 5 times. From root 14 to Terminal state 48 is found 5 times. From root 14 to Terminal state 49 is found 5 times. From root 14 to Terminal state 53 is found 6 times. From root 14 to Terminal state 55 is found 5 times. From root 14 to Terminal state 57 is found 5 times. From root 14 to Terminal state 66 is found 5 times. From root 14 to Terminal state 67 is found 5 times. From root 14 to Terminal state 70 is found 5 times. From root 14 to Terminal state 73 is found 5 times. terminal clusters [2, 4, 8, 10, 13, 16, 17, 22, 23, 24, 28, 29, 31, 37, 38, 39, 41, 42, 43, 44, 47, 49, 50, 54, 56, 58, 67, 68, 71, 74] project onto single cell start single cell projections of pseudotime and lineage likelihood number of components before pruning 2 number of connected componnents after reconnecting 2 percentage links trimmed from local pruning relative to start 17.9 percentage links trimmed from global pruning relative to start 48.1 /home/tialan/viaenv/lib/python3.7/site-packages/pyVIA/core.py:2926: RuntimeWarning: invalid value encountered in true_divide bp_array = bp_array / bp_array.sum(axis=1)[:, None] /home/tialan/viaenv/lib/python3.7/site-packages/pyVIA/core.py:3037: RuntimeWarning: divide by zero encountered in long_scalars scaled_hitting_times = scaled_hitting_times (1000 / np.max(scaled_hitting_times)) /home/tialan/viaenv/lib/python3.7/site-packages/pyVIA/core.py:3037: RuntimeWarning: invalid value encountered in multiply scaled_hitting_times = scaled_hitting_times (1000 / np.max(scaled_hitting_times))

IndexError Traceback (most recent call last) /tmp/ipykernel_29627/4051984019.py in 38 return 39 ---> 40 runvia_simulated_data('bc_9') 41

/tmp/ipykernel_29627/4051984019.py in runvia_simulated_data(name) 27 is_coarse=True, preserve_disconnected=True, name=name) 28 ---> 29 v_1.run_VIA() 30 31 #

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_VIA(self) 3413 # Query dataset, k - number of closest elements (returns 2 numpy arrays) 3414 -> 3415 self.run_subPARC() 3416 run_time = time.time() - time_start_total 3417 print('time elapsed {:.1f} seconds'.format(run_time))

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_subPARC(self) 3051 3052 for i in scaled_hitting_times: -> 3053 all_colors.append(pal.get(int(i))[0:3]) 3054 3055 locallytrimmed_g.vs['hitting_times'] = scaled_hitting_times

~/viaenv/lib/python3.7/site-packages/igraph/drawing/colors.py in get(self, v) 87 if isinstance(v, int): 88 if v < 0: ---> 89 raise IndexError("color index must be non-negative") 90 if v >= self._length: 91 raise IndexError("color index too large")

IndexError: color index must be non-negative

ShobiStassen commented 2 years ago

have you tried leaving the too_big_factor as 0.3 for this second dataset? or changing the graph pruning (jac_std_global=0.15, to a higher value of e.g. 1.0) to make sure that you are not over fragmenting the graph and getting a component with just a single cluster, which might be causing some division by zero? i havent encountered this before, so am trying to guess what might be causing this behaviour

tianlt commented 2 years ago

Yes and I tried to set the jac_std_global=1.0 but get another error: ValueError: dimension mismatch

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_53985/4111017426.py in <module>
     39     return
     40 
---> 41 runvia_simulated_data('bc_9')
     42 #runvia_simulated_data('bt_5')
     43 #runvia_simulated_data('l_0')

/tmp/ipykernel_53985/4111017426.py in runvia_simulated_data(name)
     28              is_coarse=True, preserve_disconnected=True, name=name) 
     29 
---> 30     v_1.run_VIA()
     31 
     32     #

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_VIA(self)
   3413         # Query dataset, k - number of closest elements (returns 2 numpy arrays)
   3414 
-> 3415         self.run_subPARC()
   3416         run_time = time.time() - time_start_total
   3417         print('time elapsed {:.1f} seconds'.format(run_time))

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in run_subPARC(self)
   2951         print(colored('project onto single cell', 'red'))
   2952         #self.project_branch_probability_sc(bp_array, df_graph['pt'].values) #testing to see what the effect of not doing MCMC is
-> 2953         self.project_branch_probability_sc(bp_array, df_graph['markov_pt'].values)
   2954         self.dict_terminal_super_sub_pairs = dict_terminal_super_sub_pairs
   2955         hitting_times = self.markov_hitting_times

~/viaenv/lib/python3.7/site-packages/pyVIA/core.py in project_branch_probability_sc(self, bp_array_clus, pt)
   1640         weight_array = csr_matrix((np.array(weight_list), (np.array(row_list), np.array(col_list))),
   1641                                   shape=(n_cells, n_clus))
-> 1642         bp_array_sc = weight_array.dot(bp_array_clus)
   1643         bp_array_sc = bp_array_sc * 1. / np.max(bp_array_sc, axis=0)  # divide cell by max value in that column
   1644         # print('column max:',np.max(bp_array_sc, axis=0))

~/viaenv/lib/python3.7/site-packages/scipy/sparse/base.py in dot(self, other)
    357 
    358         """
--> 359         return self * other
    360 
    361     def power(self, n, dtype=None):

~/viaenv/lib/python3.7/site-packages/scipy/sparse/base.py in __mul__(self, other)
    514 
    515             if other.shape[0] != self.shape[1]:
--> 516                 raise ValueError('dimension mismatch')
    517 
    518             result = self._mul_multivector(np.asarray(other))

I also attached my input code here;

def runvia_simulated_data(name):
    data_1 = pd.read_csv('/home/tialan/Data/cell_trajectory/code_new/' + str(name) + '.csv')
    data_1 = data_1.T
    data_1.columns = data_1.iloc[0]
    data_1 = data_1[1:]
    data_1 = pd.read_csv(StringIO(data_1.to_csv(index=False)))
    #true_label_1 = data_1.iloc[:,0]
    true_label_1 = np.concatenate((np.repeat(a = 0, repeats = 500), np.repeat(a = 1, repeats = 500)),axis=None)
    true_label_1 = pd.Series(true_label_1)

    adata = sc.AnnData(data_1)
    sc.tl.pca(adata, svd_solver='arpack', n_comps=10)
    #embedding = umap.UMAP(random_state=42, n_neighbors=15, init='random').fit_transform(data_1.values)
    root = [0]
    #adata.obsm['X_pca'] = data_1.values

    v_1 = via.VIA(adata.obsm['X_pca'][:, 0:5], true_label_1, jac_std_global=1.0, dist_std_local=1, knn=5,
             too_big_factor=0.3, root_user=root, random_seed=1,
             is_coarse=True, preserve_disconnected=True, name=name) 

    v_1.run_VIA()

    #
    super_clus_ds_PCA_loc = via.sc_loc_ofsuperCluster_PCAspace(v_1, v_1, np.arange(0, len(v_1.labels))) #location in PCA space of terminal state
    via.draw_trajectory_gams(adata.obsm['X_pca'][:, 0:5], super_clus_ds_PCA_loc, v_1.labels, v_1.labels, v_1.edgelist_maxout,
                             v_1.x_lazy, v_1.alpha_teleport, v_1.single_cell_pt_markov, true_label_1, knn=v_1.knn,
                             final_super_terminal=v_1.revised_super_terminal_clusters,
                             sub_terminal_clusters=v_1.terminal_clusters,
                             title_str='pseudotime', ncomp=5, arrow_width_scale_factor=30, name=name)
    return

runvia_simulated_data('bc_9')
ShobiStassen commented 2 years ago

you are welcome to share the synthetic data in a link as I cannot quite tell why this is happening and think it might have to do with how the underlying data is structured which is causing some outlier effects. your knn is quite low, have you tried increasing to knn = 10 or knn 15. also try using higher number of PCs. im trying to see if there is some strange behvaiour occuring at certain params

ShobiStassen commented 2 years ago

what does the umap look like?

tianlt commented 2 years ago

Thank you for the advice. It worked after I increased the pc numbers to 10 and the knn to 15.