ZhangLabGT / scMultiSim

A simulator for single cell multi-omics and spatial omics data that provides ground truth to benchmark a wide range of methods.
https://zhanglabgt.github.io/scMultiSim/
22 stars 5 forks source link

Strange expression patterns of simulated atacseq data #7

Open HelloWorldLTY opened 10 months ago

HelloWorldLTY commented 10 months ago

Hi, I notice that you have the ability to simulate multiomic data, but I have some questiosn about the simulated data.

image

It seems that for the atac-seq data, the region by cell matrix is not count data, and there exists data smaller than 1. If so, may I know the reason? Is it the result after td-idf processing? If so, what should I do if I intend to run some methods based on count method like MultiVI? Thanks.

lhc70000 commented 10 months ago

Hi, the value here is an indication of the openness of the chromatin region. If an integer count number is needed, it should be fine to just use round().

HelloWorldLTY commented 10 months ago

Hi, thanks for your explaniation. However, I am still confused. If it represents the openness of the chromatin region, image

According to its definition, it meaures the hits of fragement in the given region, so why it is not a integer but a float (even not a fraction)? Do you have any specific distribution assumption for this simulation? Thanks.

I am not sure if round can work because for example, 3.4 and 3.6 might not be in a large difference, but they will be transfered in to 3 and 4, which has a larger gap.

lhc70000 commented 10 months ago

Basically, we first sample x from a distribution x ~ D, where D is the distribution fitted from a real ATAC dataset's log-transformed counts, then output y = 2^x+1 as the simulated ATAC count matrix to recover the original data distribution. That's why it contains float numbers.

I think rounding should be fine because it will not change the distribution much; and noise is already introduced during simulation and sampling anyway.

HelloWorldLTY commented 10 months ago

Got it, thanks a lot.

Furthermore, I wonder if you have plan to integrate your results into format like h5ad or anndata, which can be handled by python in an easier approach. Most of the methods for rna velocity inference are based on python, thus I think it is a potential approach to advertise your work.

HelloWorldLTY commented 10 months ago

Here is another bug (at least I think) I just found:

image

It seems that there is no tree structure like phyla1? But it seems that in the tutorial we could use Phyla1 in our simulation. Did I miss something? Thanks.

lhc70000 commented 10 months ago

Sorry for my late reply. Phyla1 only has one argument: len, which is the branch length.

https://github.com/ZhangLabGT/scMultiSim/blob/f63c686d8ee42dfb976a2dc93354edfc286c6373/R/8_utils.R#L97

The tree contains one branch connecting the root and only one leaf:

Root ------> A

There's no plotting argument, because R refuses to plot it since it only has one tip and therefore not considered as a tree. It's likely that people don't need to visualize it since the structure is so simple, though.

> Phyla1()

Phylogenetic tree with 1 tips and 1 internal nodes.

Tip labels:
  A

Rooted; includes branch lengths.
HelloWorldLTY commented 10 months ago

Ok, thanks a lot. I will try this approach.