ShobiStassen / VIA

trajectory inference
https://pyvia.readthedocs.io/en/latest/
MIT License
78 stars 20 forks source link

Zero-size array to reduction operation #11

Closed GreenGilad closed 2 years ago

GreenGilad commented 2 years ago

Hi again,

I'm trying to use VIA to find trajectories in a small 2D (PHATE) embedding. The data itself is not single cell but is derived from a single cell dataset. Based on the analysis I suspect several connected trajectories as well as a couple of disconnected ones. Since I do not have any real true_label to give these data points I am passing [0,.....,0].

X = np.loadtxt("input_data.csv", skiprows=1, delimiter=",")
v0 = via.VIA(X, [0]*X.shape[0], knn=5, root_user=[1], is_coarse=True, preserve_disconnected=False) 
v0.run_VIA()

The above code outputs the following logs:

input data has shape 400 (samples) x 2 (features) class <class 'numpy.ndarray'> time is Mon Jan 17 11:04:47 2022 commencing global pruning Share of edges kept after Global Pruning 46.58 % number of components in the original full graph 5 for downstream visualization purposes we are also constructing a low knn-graph size neighbor array in low-KNN in pca-space for visualization (400, 5) commencing community detection time is Mon Jan 17 11:04:47 2022 163 clusters before handling small/big There are 0 clusters that are too big humanCD34 : global cluster graph pruning level 0.15 number of components before pruning 5

And then fails with the error

ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_19348/1204227179.py in 2 3 v0 = via.VIA(X, [0]*X.shape[0], knn=5, root_user=[1], is_coarse=True, preserve_disconnected=False) ----> 4 v0.run_VIA() 5 6 # tsi_list = via.get_loc_terminal_states(v0, X_in) ~\anaconda3\envs\ViaEnv\lib\site-packages\pyVIA\core.py in run_VIA(self) 3508 # Query dataset, k - number of closest elements (returns 2 numpy arrays) 3509 -> 3510 self.run_subPARC() 3511 run_time = time.time() - time_start_total 3512 print('time elapsed {:.1f} seconds'.format(run_time)) ~\anaconda3\envs\ViaEnv\lib\site-packages\pyVIA\core.py in run_subPARC(self) 2681 global_pruning_std=global_pruning_std, 2682 preserve_disconnected=self.preserve_disconnected, -> 2683 preserve_disconnected_after_pruning=self.preserve_disconnected_after_pruning) 2684 self.connected_comp_labels = comp_labels 2685 ~\anaconda3\envs\ViaEnv\lib\site-packages\pyVIA\core.py in pruning_clustergraph(adjacency_matrix, global_pruning_std, max_outgoing, preserve_disconnected, preserve_disconnected_after_pruning) 1064 1065 if (n_components > 1) & (preserve_disconnected == False): -> 1066 cluster_graph_csr = connect_all_components(Tcsr, cluster_graph_csr, adjacency_matrix) 1067 n_components, comp_labels = connected_components(csgraph=cluster_graph_csr, directed=False, return_labels=True) 1068 ~\anaconda3\envs\ViaEnv\lib\site-packages\pyVIA\core.py in connect_all_components(MSTcsr, cluster_graph_csr, adjacency_matrix) 960 sub_td = MSTcsr[comp_labels == 0, :][:, comp_labels != 0] 961 # print('minimum value of link connecting components', np.min(sub_td.data)) --> 962 locxy = scipy.sparse.find(MSTcsr == np.min(sub_td.data)) 963 for i in range(len(locxy[0])): 964 if (comp_labels[locxy[0][i]] == 0) & (comp_labels[locxy[1][i]] != 0):

<__array_function__ internals> in amin(*args, **kwargs) ~\anaconda3\envs\ViaEnv\lib\site-packages\numpy\core\fromnumeric.py in amin(a, axis, out, keepdims, initial, where) 2857 """ 2858 return _wrapreduction(a, np.minimum, 'min', axis, None, out, -> 2859 keepdims=keepdims, initial=initial, where=where) 2860 2861 ~\anaconda3\envs\ViaEnv\lib\site-packages\numpy\core\fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs) 85 return reduction(axis=axis, out=out, **passkwargs) 86 ---> 87 return ufunc.reduce(obj, axis, dtype, out, **passkwargs) 88 89 ValueError: zero-size array to reduction operation minimum which has no identity

I am not able to figure out what is causing this error and how to fix it.

In addition, along the way I found two minor bugs:

ShobiStassen commented 2 years ago

Hi,

I have a few suggestions you can try first:

  1. Change the preserve_disconnected to True (you currently have it as false).
  2. Consider changing the jac_std_global and dist_std_local parameters to 1 first. Have you tried to run it on more than 2 components or are you intended to only use 2 Phate components as input? v0 = via.VIA(X, [0]*X.shape[0], knn=5, root_user=[1], is_coarse=True, preserve_disconnected=True, jac_std_global=1, dist_std_local=1 )
ShobiStassen commented 2 years ago

the do_magic_bool is now replaced by "do_impute_bool" (I'll update this in the NB later). If you intend to do any gene imputation, please set "do_impute_bool" to True so that the imputation function can be called when needed later on: v0 = via.VIA(anndata.obsm['X_pca'][:, 0:n_pcs], knn=20, ..., do_imputebool = True) ..... df = pd.DataFrame(anndata.X) #create dataframe for input to imputation function df_.columns = [i for i in ad.var_names]

start imputation

gene_list_impute = ['IL3RA', 'IRF8', 'GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B', 'SPI1', 'CD34', 'CSF1R', 'ITGAX']
df_imputed= v0.do_impute(df_, magic_steps=3, gene_list=gene_list_magic)
GreenGilad commented 2 years ago

I had already played with the different parameters and it didn't help, but I (in a very silly manner) misinterpreted the preserve_disconnected parameter. preserve_disconnected=True solved it.

As for only using the 2 components I might run it over the higher dimension dataset from which I obtained this PHATE embedding, but that too is not in single cell scale but of about 100 dims.