BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)
https://cell2location.readthedocs.io/en/latest/
Apache License 2.0
307 stars 57 forks source link

Reconstruction accuracy and QC plots #177

Open wangjiawen2013 opened 2 years ago

wangjiawen2013 commented 2 years ago

Hi, Thanks for your great work ! Now I am using cell2location to analyze my single cell and spatial data ! However, I cannot get the first plots (posterior expected value) using mod.plot_QC (https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html), while the second plot (mean expression for every gene in ever cluster) can be generated successfully.

It is said that: 1 Reconstruction accuracy to assess if there are any issues with inference. This 2D histogram plot should have most observations along a noisy diagonal. 2 The estimated expression signatures are distinct from mean expression in each cluster because of batch effects. For scRNA-seq datasets which do not suffer from batch effect (this dataset does), cluster average expression can be used instead of estimating signatures with a model. When this plot is very different from a diagonal plot (e.g. very low values on Y-axis, density everywhere) it indicates problems with signature estimation. Then what is "batch effect" ? In my dataset, there is only one sample and the experiment was performed only once, does that mean there are not any batch effects in my dataset ? It seems that my data is very noisy from the following QC plot, what source may the noise come from ? My single cell dataset is mouse brain single-nuclei RNAseq, not single-cell RNAseq. Does cell2location support sinlge-nuclei RNAseq ? Or only single-cells are supported ?

屏幕快照 2022-07-10 下午1 58 00

vitkl commented 2 years ago

Try adding plt.show() after each plot.

wangjiawen2013 commented 1 year ago

Hi, mod.plot_QC() indeed only generate one plot. Besides, the title and the x,y-axis labels of the two plots are confused and merged into one plot. here's the source code and you can see it. https://cell2location.readthedocs.io/en/latest/_modules/cell2location/models/reference/_reference_model.html#RegressionModel.plot_QC

def plot_QC(
    self,
    summary_name: str = "means",
    use_n_obs: int = 1000,
    scale_average_detection: bool = True,
):
    """
    Show quality control plots:
    1. Reconstruction accuracy to assess if there are any issues with model training.
        The plot should be roughly diagonal, strong deviations signal problems that need to be investigated.
        Plotting is slow because expected value of mRNA count needs to be computed from model parameters. Random
        observations are used to speed up computation.

    2. Estimated reference expression signatures (accounting for batch effect)
        compared to average expression in each cluster. We expect the signatures to be different
        from average when batch effects are present, however, when this plot is very different from
        a perfect diagonal, such as very low values on Y-axis, non-zero density everywhere)
        it indicates problems with signature estimation.

    Parameters
    ----------
    summary_name
        posterior distribution summary to use ('means', 'stds', 'q05', 'q95')

    Returns
    -------

    """

    super().plot_QC(summary_name=summary_name, use_n_obs=use_n_obs)
    plt.show()

    inf_aver = self.samples[f"post_sample_{summary_name}"]["per_cluster_mu_fg"].T
    if scale_average_detection and ("detection_y_c" in list(self.samples[f"post_sample_{summary_name}"].keys())):
        inf_aver = inf_aver * self.samples[f"post_sample_{summary_name}"]["detection_y_c"].mean()
    aver = self._compute_cluster_averages(key=REGISTRY_KEYS.LABELS_KEY)
    aver = aver[self.factor_names_]

    plt.hist2d(
        np.log10(aver.values.flatten() + 1),
        np.log10(inf_aver.flatten() + 1),
        bins=50,
        norm=matplotlib.colors.LogNorm(),
    )
    plt.xlabel("Mean expression for every gene in every cluster")
    plt.ylabel("Estimated expression for every gene in every cluster")
    plt.show()
wangjiawen2013 commented 1 year ago

here is the image on cell2location documentation: image while this is the image outputed by cell2location when I ran the code on the documentation, both the title, axis label and number of images are different (only one here): image