Some question about reproducing result

HantaoShu commented 2 years ago

Hi Alexander P. Wu, I have some questions about reproducing results on the ICLR paper. 1), how to choose the iroot for each dataset in the construct_dag function. 2) File model.py Line:20-21 Should it be max(rna_idx) +1 instead of len(rna_idx)? The max number of rna_idx can be larger than the length of rna_idx. 3) As introduced in the APPENDIX A1 of the paper, PCAs are used for both ATAC-seq and RNA-seq during the joint embedding generation. But I found the following error:

Input: rna = sc.read_h5ad('../process_data/A549/RNA0_filter_0.05_counts.h5ad') atac = sc.read_h5ad('../process_data/A549/ATAC0_filter_0.001_counts.h5ad') sc.pp.normalize_per_cell(rna,100000) sc.pp.normalize_per_cell(atac,10000) sc.pp.log1p(rna) sc.pp.log1p(atac) sc.pp.pca(rna,n_comps=100) sc.pp.pca(atac,n_comps=100) from scipy.stats import spearmanr rna_pca,atac_pca = [],[] for i in range(100): if (spearmanr(rna.obsm['X_pca'][:,i],rna.X.toarray().sum(1))[0])<0.9: rna_pca.append(i) for i in range(100): if (spearmanr(atac.obsm['X_pca'][:,i],atac.X.toarray().sum(1))[0])<0.9: atac_pca.append(i) sqp = schema.SchemaQP( min_desired_corr=0.9,params= {'decomposition_model': 'nmf', 'num_top_components': 20} ) mod_X = sqp.fit_transform( rna.obsm['X_pca'][:,rna_pca],[atac.obsm['X_pca'][:,atac_pca]], [ 'feature_vector' ] )

Output:

ValueError Traceback (most recent call last)

in 3 mod_X = sqp.fit_transform( rna.obsm['X_pca'][:,rna_pca], # primary modality 4 [atac.obsm['X_pca'][:,atac_pca]], # list of secondary modalities ----> 5 [ 'feature_vector' ] ) # datatypes of secondary modalities ~/anaconda2/envs/geometric/lib/python3.7/site-packages/schema/schema_qp.py in fit_transform(self, d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list, secondary_data_dist_kernels, d0, d0_dist_transform) 416 """ 417 --> 418 self.fit(d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list, secondary_data_dist_kernels, d0, d0_dist_transform) 419 return self.transform(d) 420 ~/anaconda2/envs/geometric/lib/python3.7/site-packages/schema/schema_qp.py in fit(self, d, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list, secondary_data_dist_kernels, d0, d0_dist_transform) 375 self._fit_scale(t_d, t_d0, t_secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list) 376 else: #affine --> 377 self._fit_affine(t_d, t_d0, t_secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list) 378 379 ~/anaconda2/envs/geometric/lib/python3.7/site-packages/schema/schema_qp.py in _fit_affine(self, d, d0, secondary_data_val_list, secondary_data_type_list, secondary_data_wt_list) 1098 with warnings.catch_warnings(): 1099 warnings.simplefilter("ignore") -> 1100 self._decomp_mdl.fit(dx1) 1101 1102 self._orig_decomp_mdl = copy.deepcopy(self._decomp_mdl) ~/anaconda2/envs/geometric/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py in fit(self, X, y, **params) 1340 self 1341 """ -> 1342 self.fit_transform(X, **params) 1343 return self 1344 ~/anaconda2/envs/geometric/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py in fit_transform(self, X, y, W, H) 1315 l1_ratio=self.l1_ratio, regularization=self.regularization, 1316 random_state=self.random_state, verbose=self.verbose, -> 1317 shuffle=self.shuffle) 1318 1319 self.reconstruction_err_ = _beta_divergence(X, W, H, self.beta_loss, ~/anaconda2/envs/geometric/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 ~/anaconda2/envs/geometric/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle) 1022 X = check_array(X, accept_sparse=('csr', 'csc'), 1023 dtype=[np.float64, np.float32]) -> 1024 check_non_negative(X, "NMF (input X)") 1025 beta_loss = _check_string_param(solver, regularization, beta_loss, init) 1026 ~/anaconda2/envs/geometric/lib/python3.7/site-packages/sklearn/utils/validation.py in check_non_negative(X, whom) 1123 1124 if X_min < 0: -> 1125 raise ValueError("Negative values in data passed to %s" % whom) 1126 1127 ValueError: Negative values in data passed to NMF (input X) Looking for your response.

rs239 commented 2 years ago

If you already have the PCA computed, you shouldn't be calling Schema with NMF decomposition enabled-- instead you just want the mode='affine' argument. Pls see https://schema-multimodal.readthedocs.io/en/latest/api/index.html

Something like this maybe? SchemaQP(min_desired_corr=0.9, mode='affine', params={})

HantaoShu commented 2 years ago

Thank you for your response. Could you kindly answer the other two questions, especially how to choose iroot for datasets used in the paper?

alexw16 commented 2 years ago

Hi Hantao, Thank you for your questions. 1) For choosing the root cell, we identified the cell with the highest expression of a marker gene corresponding to the cell type/condition that is expected biologically to lie at the beginning of the trajectory/biological process (e.g., Dlx3 in TAC progenitors for SHARE-seq). 2) We used len(rna_idx) to determine the number of genes to consider in the model, as some genes in a dataset may be too sparsely represented. In our use case, "rna_idx" represents a list of indices for just the set of genes which we want to include and analyze in our model.

HantaoShu commented 2 years ago

Thanks for your response!

alexw16 / gridnet

Some question about reproducing result #1

Output: