Cannot generate data with cluster structure

PeterZZQ commented 2 years ago

Hi,

I wish to use sergio to benchmark the data imputation method. When I run the run_sergio.ipynb example and visualize the clean expression data, I cannot see any clear cluster structure from the visualization (there should be 9 clusters according to the setting in the example function if I understand the readme file correctly). Please see the screenshot below for the code and visualization result:

I also visualize the data after adding technical noise, still there is no cluster structure:

May I know how I can fix the problem? Thanks!

PayamDiba commented 2 years ago

let's start with the clean simulated data (i.e. before adding technical noise). The clean data, since represents the true mRNA levels in each cell, does not requires library size normalization. So, you can try using the clean simulated data, as it is, for clustering. Also, it would be good if you could first run PCA on the clean data to extract the first few PCs (e.g. top 20 PCs) and then use that as an input for UMAP. Also can you color the cells according to their cell type assignment so that we could see the global structure. Please make these changes and let me know how the results look like. After that we can move to noisy simulated data.

PeterZZQ commented 2 years ago

Ok it shows me something like this: The code that I used

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)

# Plot
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 15)
counts_true = expr_clean.T
x_pca = pca_op.fit_transform(counts_true)
x_umap = tsne_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba commented 2 years ago

Can you try log1p transformation on the clean simulated data before applying PCA? i.e. first transform the clean data with log(1+data) then apply PCA followed by UMAP or tSNE. Let me know the results.

PeterZZQ commented 2 years ago

Yes, here is the result:

The code that I used, the simulation part is from the run_sergio.ipynb:

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 15)
counts_true = expr_clean.T
# counts_true = counts_true/np.sum(counts_true, axis = 1)[:, None] * 100
counts_true = np.log1p(counts_true)
x_pca = pca_op.fit_transform(counts_true)
x_umap = tsne_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba commented 2 years ago

mmm ... it looks strange. Can you try simulating another GRN (the one with 400 genes). Also can you plot the umap results (I think your plots above are for tSNE).

PeterZZQ commented 2 years ago

Yes, it seems to be better for 400 genes. The code

sim = sergio(number_genes=100, number_bins = 9, number_sc = 300, noise_params = 1, decays=0.8, sampling_state=15, noise_type='dpd')
sim.build_graph(input_file_taregts ='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Interaction_cID_4.txt', input_file_regs='data_sets/De-noised_100G_9T_300cPerT_4_DS1/Regs_cID_4.txt', shared_coop_state=2)
sim.simulate()
expr = sim.getExpressions()
expr_clean = np.concatenate(expr, axis = 1)
anno = [np.array([str(x)] * 300, dtype=np.object) for x in range(9)]
anno = np.concatenate(anno, axis = 0)

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP 
import matplotlib.pyplot as plt
pca_op = PCA(n_components = 20)
# tsne_op = TSNE(n_components = 2)
umap_op = UMAP(n_components = 2, min_dist = 0.4, n_neighbors = 50)
counts_true = expr_clean.T
# counts_true = counts_true/np.sum(counts_true, axis = 1)[:, None] * 100
counts_true = np.log1p(counts_true)
x_pca = pca_op.fit_transform(counts_true)
x_umap = umap_op.fit_transform(x_pca)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PeterZZQ commented 2 years ago

May I know why there are line-like structures? Feels like it still hasn't reached the steady-state. Should I add more safety_steps? Currently is 0:

PayamDiba commented 2 years ago

Great, for 100 genes I think you should lower the "noise_params" (maybe setting it to 0.1 or 0.3 instead of 1).

The line-like structure is due to the clean nature of simulated data, after adding technical noise you will get more familiar structures. I think you don't need to add safety steps, simulations are already started from regions close to steady-state region.

After adding technical noise, you need to do all standard normalizations including library size normalizations followed by log1p transformation.

PeterZZQ commented 2 years ago

Ok after I reduce the noise_params to 0.1, the result for 100 gene is like But after I add the technical noise the result is like: Here is the code for adding technical noise, should I adjust some parameters?

"""
Add outlier genes
"""
expr_O = sim.outlier_effect(expr, outlier_prob = 0.01, mean = 0.8, scale = 1)

"""
Add Library Size Effect
"""
libFactor, expr_O_L = sim.lib_size_effect(expr_O, mean = 4.6, scale = 0.4)

"""
Add Dropouts
"""
binary_ind = sim.dropout_indicator(expr_O_L, shape = 6.5, percentile = 82)
expr_O_L_D = np.multiply(binary_ind, expr_O_L)

"""
Convert to UMI count
"""
count_matrix = sim.convert_to_UMIcounts(expr_O_L_D)

"""
Make a 2d gene expression matrix
"""
count_matrix = np.concatenate(count_matrix, axis = 1)

PeterZZQ commented 2 years ago

And I normalized and log-transformed the observed count

from umap import UMAP 
import matplotlib.pyplot as plt
umap_op = UMAP(n_components = 2, min_dist = 0.1)
counts_obs = count_matrix.T
counts_obs = counts_obs/(np.sum(counts_obs, axis = 1)[:, None] + 1e-6)* 100
counts_obs = np.log1p(counts_obs)
x_umap = umap_op.fit_transform(counts_obs)
fig = plt.figure(figsize = (10, 7))
ax = fig.add_subplot()
colormap = plt.get_cmap("Paired")
for i, clust in enumerate(np.sort(np.unique(anno))):
    idx = np.where(anno == clust)[0]
    ax.scatter(x_umap[idx, 0], x_umap[idx, 1], color = colormap(i), label = clust)
ax.legend()

PayamDiba commented 2 years ago

Can you plot UMAP instead? Also, incorrect noise parameters can result in excessive technical noise which in turn causes inaccurate low-dimensional representation. You can read the noise parameters used for each data from the supplementary files of the paper. I'll soon add to the github a python script for optimization of noise parameters with respect to a provided real data.

PeterZZQ commented 2 years ago

Ok I tried adjusting the technical noise parameters. I changed themean of lib_size_effect from 4.6 to 6.8, and the percentile of dropout_indicator from 82 to 50. The visualization seems to be well clustered this time, but I'm not sure if that is a reasonable change because the original values don't seem to be randomly decided.

PayamDiba commented 2 years ago

Reducing dropout percentile from 82 to ~65 or so should also result in good clusters (even without changing "library_size_effect"). The parameters of technical noise should be tuned with respect to a user-defined real data (such that certain statistical measures of real and simulated data become comparable.) I'll soon add an automated python script for such an optimization.

anisioti commented 1 year ago

Hello! Thank you @PeterZZQ for raising the issue and @PayamDiba for the insightful replies and the nice work! I am currently working on an application that studies graph inference in settings with heterogeneity. I want to test it using Sergio and relatively small graphs (at most 25 nodes). I saw the comment above about using lower values of the "noise_params" and it indeed makes a lot of difference on how the resulting data look like. I wanted to ask if you have any high level explanation on why smaller values of this noise parameter are more appropriate for graphs with lower number of nodes/genes. Thank you in advance!

PayamDiba / SERGIO

Cannot generate data with cluster structure #14