getting denoised counts

shobhitagrawal1 commented 2 days ago

Dear Peter,

I am reporting an issue with the software. I've been running the workflow as suggested in the demo notebook and getting good results for the latent dimension representation, gene weights, and separation of the two condition values (treatment and tissue in my case). However, I'm encountering a problem when trying to get denoised counts.

Issue Description

When attempting to obtain denoised counts, I consistently end up with greater than 1 values for only a very few genes (4 out of 20,400). These 4 genes are also the most expressed genes in the matrix, while the rest of the values are all less than 1. I'm using raw counts as input and have tried several runs with different combinations, unfortunately without success.

Code and Setup

Here's my basic script for setting up and running the model:

# Set up and run the model
data_dict = create_scdisinfact_dataset(
    counts=counts,
    meta_cells=meta_cells,
    condition_key=["tissue", "condition"],
    batch_key="site"
)

reg_mmd_comm = 1e-4
reg_mmd_diff = 1e-4
reg_kl_comm = 1e-5
reg_kl_diff = 1e-2
reg_class = 1
reg_gl = 1

Ks = [15, 2, 2]  # 8 dimensions as recommended were not separating cell clusters - so upped it
batch_size = 128
nepochs = 500  # i also ran with a small number of epochs
interval = 10
lr = 5e-4
lambs = [reg_mmd_comm, reg_mmd_diff, reg_kl_comm, reg_kl_diff, reg_class, reg_gl]
model = scdisinfact(data_dict=data_dict, Ks=Ks, batch_size=batch_size, interval=interval, lr=lr, 
                    reg_mmd_comm=reg_mmd_comm, reg_mmd_diff=reg_mmd_diff, reg_gl=reg_gl, reg_class=reg_class, 
                    reg_kl_comm=reg_kl_comm, reg_kl_diff=reg_kl_diff, seed=0, device=device)
model.train()
losses = model.train_model(nepochs=nepochs, recon_loss="NB")

Attempts to Get Denoised Counts

I've tried two approaches to get denoised counts:

1. Following the notebook:

# Get unique IDs
unique_ids = adata_subset.obs['id'].unique()

# Initialize an array to store the denoised counts
denoised_counts = np.zeros_like(adata_subset.X)

for id in unique_ids:
    # Create a boolean mask for the current ID
    id_mask = (adata_subset.obs['id'] == id).values

    # Extract counts for the current ID
    counts_test = adata_subset.X[id_mask, :]

    # Extract metadata for the current ID and convert categorical columns to string
    meta_test = adata_subset.obs.loc[id_mask, :].copy()
    for col in ['tissue', 'condition', 'site']:
        if isinstance(meta_test[col].dtype, pd.CategoricalDtype):
            meta_test[col] = meta_test[col].astype(str)

    # Predict denoised counts
    counts_test_denoised = model.predict_counts(
        input_counts=counts_test,
        meta_cells=meta_test,
        condition_keys=["tissue", "condition"],
        batch_key="site",
        predict_conds=None,
        predict_batch=None
    )

    # Store the denoised counts
    denoised_counts[id_mask, :] = counts_test_denoised

# Add the denoised counts as a new layer 'denoise_ref'
adata_subset.layers['denoise_ref'] = denoised_counts

print("Denoised counts have been added as 'denoise_ref' layer to adata_new.")
a = np.sum(adata_subset.layers['denoise_ref'] > 1, axis=0)
set(a)
# Output: {0, 11225, 39996, 63637, 189042}

2. Extracting from the original count matrix:

_ = model.eval()

# Concatenate original counts
original_counts = np.concatenate([dataset.counts.cpu().numpy() for dataset in data_dict["datasets"]], axis=0)

# Concatenate denoised counts (mu)
denoised_counts = []
z_cs = []
z_ds = [[], []]  # One list for each condition

for dataset in data_dict["datasets"]:
    with torch.no_grad():
        dict_inf = model.inference(counts=dataset.counts_norm.to(model.device), 
                                   batch_ids=dataset.batch_id[:,None].to(model.device), 
                                   print_stat=True)
        dict_gen = model.generative(z_c=dict_inf["mu_c"], 
                                    z_d=dict_inf["mu_d"], 
                                    batch_ids=dataset.batch_id[:,None].to(model.device))

        denoised_counts.append(dict_gen["mu"].cpu().detach().numpy())
        z_cs.append(dict_inf["mu_c"].cpu().detach().numpy())
        z_ds[0].append(dict_inf["mu_d"][0].cpu().detach().numpy())
        z_ds[1].append(dict_inf["mu_d"][1].cpu().detach().numpy())

denoised_counts = np.concatenate(denoised_counts, axis=0)
# This denoised counts also has only 4 greater than 1 values for all genes

Both approaches result in only 4 genes having values greater than 1 in the denoised counts.

I would greatly appreciate any insights or suggestions on how to resolve this issue. Thank you for your time and assistance.

PeterZZQ commented 1 day ago

Hi, is the expression of these 4 genes significantly different from the remaining genes? May I see the histogram of the denoised gene expression data? Also it might be useful to see the umap visualization of the denoised counts. I think it might be a scaling issue.

PeterZZQ commented 1 day ago

Hi, we fond the bug within the prediction function that caused the scaling issue, and fixed the code in the newest update. Please use the newest version of the code.

shobhitagrawal1 commented 13 hours ago

Dear Peter, Thank you for your response. Much appreciated. I can send you the histogram and the UMAP- the UMAP was a blob of points, I think my next input will be after I have run the new version as you suggested (thank you for updating!). As far as the identity of the 4 genes is concerned - yes they are highly expressed and much higher than other (therefore I was wondering normalized counts would work better). In any case, I will redo the analysis and respond. Thank you once again, this tool is perfect for my study design, therefore I appreciate it all the more.

PeterZZQ commented 7 hours ago

Oh you mean that the clusters are not separated in the umap visualization after denoising? It sounds more like an issue related to the training process than the scaling issue. I think the 4 gene is not the issue that causes the mixing of cell types. The model is not reconstructing/denoising the input successfully. Did you check the latent space and observe the cell type separation in the latent space? How about the reconstruction loss? If the cell types in the latent space are separated and reconstruction loss converges, then when you reconstruct the data it should at least show some cell type separation in the umap visualization. I see that you set the latent space to 15 because 8 doesn't work, this should not happen as 8 is generally more than enough to observe separation. Maybe you can try to set all the regularization to 0, using only the reconstruction loss for sanity check. It should definitely give you cell type separation for the denoised count.

ZhangLabGT / scDisInFact