aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
167 stars 27 forks source link

NaN values for DEG with scenicplus #70

Closed tu4n-p closed 1 year ago

tu4n-p commented 1 year ago

Hi there, I'm using scenicplus for separate scATAC seq and scRNAseq from my lab following the melanoma tutorial. Things went fine for the most part but scenicplus wasn't able to generate DEGs for DEG-DAR with the following message. I went back to the code and figured out that the a lot of values of RNA expression level from scenicplus pseudocell generation became NaN after normalization and log1p transformation and that led to the NaN mean expression which led to aforementioned problem. It'd be great to hear your thoughts on how I should go about this. Should I normalize data in a different way or should I skip the normallization part? Thank you, This is the command I ran and the error message in brief:

Screen Shot 2022-11-21 at 12 33 41 AM

This is the indepth error message: 2022-11-21 00:24:30,030 SCENIC+_wrapper INFO /scratch/JEK/scenicplus2 folder already exists. 2022-11-21 00:24:30,030 SCENIC+_wrapper INFO Calculating TF-eGRNs AUC correlation 2022-11-21 00:24:35,229 SCENIC+_wrapper INFO Calculating DEGs/DARs 2022-11-21 00:24:35,230 SCENIC+ INFO Calculating DEGs for variable refined_annotation

ValueError Traceback (most recent call last) Input In [12], in <cell line: 2>() 21 except Exception as e: 22 #in case of failure, still save the object 23 dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus2/scplus_obj.pkl'), 'wb'), protocol=-1) ---> 24 raise(e)

Input In [12], in <cell line: 2>() 2 try: 3 sys.stderr = open(os.devnull, "w") # silence stderr ----> 4 run_scenicplus( 5 scplus_obj = scplus_obj, 6 variable = ['refined_annotation'], 7 species = 'dmelanogaster', 8 assembly = 'dm6', 9 tf_file = '/gpfs/home/tpham43/14_16/allTFs_dmel-lite.txt', 10 save_path = os.path.join(work_dir, 'scenicplus2'), 11 biomart_host = biomart_host, 12 upstream = [100, 1500], 13 downstream = [100, 1500], 14 calculate_TF_eGRN_correlation = True, 15 calculate_DEGs_DARs = True, 16 export_to_loom_file = True, 17 export_to_UCSC_file = True, 18 n_cpu = 12, 19 _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
20 sys.stderr = sys.stderr # unsilence stderr 21 except Exception as e: 22 #in case of failure, still save the object

File /gpfs/home/tpham43/scenicplus/src/scenicplus/wrappers/run_scenicplus.py:305, in run_scenicplus(scplus_obj, variable, species, assembly, tf_file, save_path, biomart_host, upstream, downstream, region_ranking, gene_ranking, simplified_eGRN, calculate_TF_eGRN_correlation, calculate_DEGs_DARs, export_to_loom_file, export_to_UCSC_file, tree_structure, path_bedToBigBed, n_cpu, _temp_dir, **kwargs) 303 log.info('Calculating DEGs/DARs') 304 for var in variable: --> 305 get_differential_features(scplus_obj, var, use_hvg = True, contrast_type = ['DEGs', 'DARs']) 307 if export_to_loom_file is True: 308 log.info('Exporting to loom file')

File /gpfs/home/tpham43/scenicplus/src/scenicplus/diff_features.py:89, in get_differential_features(scplus_obj, variable, use_hvg, contrast_type, adjpval_thr, log2fc_thr, min_cells) 87 sc.pp.log1p(adata) 88 if use_hvg: ---> 89 sc.pp.highly_variable_genes( 90 adata, min_mean=0.0125, max_mean=3, min_disp=min_disp, max_disp=np.inf) 91 var_features = adata.var.highly_variable[adata.var.highly_variable].index.tolist( 92 ) 93 adata = adata[:, var_features]

File ~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:432, in highly_variable_genes(adata, layer, n_top_genes, min_disp, max_disp, min_mean, max_mean, span, n_bins, flavor, subset, inplace, batch_key, check_values) 420 return _highly_variable_genes_seurat_v3( 421 adata, 422 layer=layer, (...) 428 inplace=inplace, 429 ) 431 if batch_key is None: --> 432 df = _highly_variable_genes_single_batch( 433 adata, 434 layer=layer, 435 min_disp=min_disp, 436 max_disp=max_disp, 437 min_mean=min_mean, 438 max_mean=max_mean, 439 n_top_genes=n_top_genes, 440 n_bins=n_bins, 441 flavor=flavor, 442 ) 443 else: 444 sanitize_anndata(adata)

File ~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:216, in _highly_variable_genes_single_batch(adata, layer, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor) 213 df['dispersions'] = dispersion 214 if flavor == 'seurat': 215 #print(df['means']) --> 216 df['mean_bin'] = pd.cut(df['means'], bins=n_bins) 217 disp_grouped = df.groupby('mean_bin')['dispersions'] 218 disp_mean_bin = disp_grouped.mean()

File ~/.local/lib/python3.8/site-packages/pandas/core/reshape/tile.py:293, in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates, ordered) 290 if (np.diff(bins.astype("float64")) < 0).any(): 291 raise ValueError("bins must increase monotonically.") --> 293 fac, bins = _bins_to_cuts( 294 x, 295 bins, 296 right=right, 297 labels=labels, 298 precision=precision, 299 include_lowest=include_lowest, 300 dtype=dtype, 301 duplicates=duplicates, 302 ordered=ordered, 303 ) 305 return _postprocess_for_cut(fac, bins, retbins, dtype, original)

File ~/.local/lib/python3.8/site-packages/pandas/core/reshape/tile.py:420, in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates, ordered) 418 if len(unique_bins) < len(bins) and len(bins) != 2: 419 if duplicates == "raise": --> 420 raise ValueError( 421 f"Bin edges must be unique: {repr(bins)}.\n" 422 f"You can drop duplicate edges by setting the 'duplicates' kwarg" 423 ) 424 else: 425 bins = unique_bins

ValueError: Bin edges must be unique: array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]). You can drop duplicate edges by setting the 'duplicates' kwarg

Version (please complete the following information):

Additional context Add any other context about the problem here.

SeppeDeWinter commented 1 year ago

Hi @tu4n-p

Could you show the output of following commands:

scplus_obj.to_df('EXP')
scplus_obj.to_df('ACC')
scplus_obj.to_df('EXP').sum(0)
scplus_obj.to_df('EXP').sum(1)
scplus_obj.to_df('ACC').sum(0)
scplus_obj.to_df('ACC').sum(1)

Best,

Seppe

tu4n-p commented 1 year ago

Here they are, as for rna-seq, I used scaled data. Thank you so much for looking into this, T

Screen Shot 2022-11-23 at 1 53 49 PM Screen Shot 2022-11-23 at 1 53 58 PM Screen Shot 2022-11-23 at 1 54 13 PM Screen Shot 2022-11-23 at 1 54 20 PM
SeppeDeWinter commented 1 year ago

Sorry for the late reply.

I think the reason for these NaNs is because your gene expression data contains negative values (it's probably already normalized), inside the function to calculate differentially expressed genes it will be log-normalised, probably resulting in the NaN values.

We expect the gene expression to be raw counts.

It will probably not affect the GRN inference that much though.

You can try again using raw counts or you can skip this step (it's optional), by setting calculate_DEGs_DARs to False.

Hope this helps.

Best,

Seppe

tu4n-p commented 1 year ago

Got it. Thank you!