scATAC: how to start from fragment.tsv.gz & per-peak tn5 bias estimates

vitkl commented 1 year ago

Nice tool, congratulations! I would like to use it on 10x ATAC/multiome data. Could you please explain the following?

It would be great to see an example of how to start from fragment.tsv.gz standard cellranger output. Is what needed simply a conversion of that file into bed format and passing it to --input_fragments=FRAGMENTs? If you have the code to do that would be great if you share.
I need per-peak tn5 bias estimates (one value per peak) - is that stored in NAME_biasExpCuts.bw?

Tarela commented 1 year ago

Hi Vitkl,

Thanks for trying SELMA.

(It seems that I should answer q2 first) Since you were analyzing 10x single cell ATAC-seq data, I believe the "sc mode" of SELMA is what you need instead of the "bulk mode". Since the "sc mode" doesn't gneerate biasExpCuts.bw file, I believe you were using the wrong mode. Please take a look at the section 3. RUN SELMA (usage) in the readme document to get an example cmd line for the "sc mode".

in the "sc mode" you will get an output file named as NAME_peakFeatures.txt.gz, which contain the "per-peak tn5 bias estimates (one value per peak)" as you described in the last column ("avebias"). The higher the value is, the greater effect the intrinsic Tn5 cleavage bias has on that peak. See the section 8. "output for SELMA sc mode with testing data input" in the readme document as an example.

You can use the fragment.tsv.gz as a direct input to SELMA. Rememebr to add the 10x specific parameter "--scATAC10x" in the cmd line to let SELMA know that the fragment files were from 10x pipeline. Please take a look at the section 3. RUN SELMA (usage) in the readme document to get an example cmd line for the "sc mode".

Let me know if you have further questions. Feel free to comment/suggest any potential new features you would expect from SELMA.

Thanks Shawn

vitkl commented 1 year ago

hi @Tarela

Thanks for the earlier explanations! I am trying to understand the meaning of this score avebias. As far as I understand from the paper, avebias is the SELMA bias average across all fragments and all cells that fall within a given peak. The bias score is proportional to the difference between predicted tn5 bias for the observed fragment compared to the background frequency of the fragment k-mer in the entire genome. Do I understand this correctly?

The score seems to be largely positive (but with a tiny number of peaks with avebias < 0), the score is close to 1, and the correlation between 10x batches (multiome) is high but not perfect.

I would like to know:

Is the expected Pearson R of avebias between two samples = 0.9? Does fairly low Pearson R (compared to Fig 1g) mean that the correct way to apply SELMA to many 10x batches is A) combining the fragments file from all batches into one file OR B) applying SELMA to each batch separately and then computing the average of avebias across batches?
What is the correct way to interpret how avebias affects data? Linear scaling in cut site counts space? Linear scaling in log space? See the equations below.

Let's assume we have a relatively simply scATAC model like this:

d_cp ~ Poisson(mu = mu_cp * avebias_p * y_c)
mu_cp = VAE(d_cp)

where d_cp is observed cut site count, mu_cp is "biological accessibility" generated by a model such as VAE (e.g. scVI), y_c is per cell normalisation and avebias_p is the bias score we are discussing.

What is the correct way to apply avebias_p for scaling?

d_cp ~ Poisson(mu = mu_cp * avebias_p * y_c)
# assuming (mu_cp * avebias_p * y_c) > 0

or

d_cp ~ Poisson(mu = exp(mu_cp * avebias_p) * y_c)

?

Tarela commented 1 year ago

Hi Vitkl,

In order to be clear, a. The intrinsic cleavage bias of Tn5 (we called SELMA bias score, mainly discussed in Fig1), reflect whether a k-mer is preferred by the Tn5 enzyme. b. The peak bias score (referred as PBS in the paper, and referred as avebias in the SELMA single-cell mode output) is specifically designed in single-cell ATAC-seq analysis to reflect whether a peak should be re-weighted in the single cell clustering analysis. These two scores are largely different in terms of definition and usage.

"avebias is the SELMA bias average across all fragments and all cells that fall within a given peak." correct!

" The bias score is proportional to the difference between predicted tn5 bias for the observed fragment compared to the background frequency of the fragment k-mer in the entire genome." If the bias score referred to the avebias mentioned above, not really. The avebias (in the paper we called it peak bias score, PBS) was calculated by averaging the Tn5 cleavage bias of all the insertion observed on that peak. If the bias score referred to the intrinsic cleavage bias for each k-mer (discussed mainly in Figure1), the intrinsic cleavage bias is originally estimated by comparing the number of this k-mer being inserted to the number of k-mer sequence observed. And we further use simplex encoding model to improve the estimation. You can check the method section for more details.

"I would like to know:"

since you mentioned "compared to Fig 1g" in your description, I'd like to highlight that the avebias(PBS) in the single-cell mode output is not the SELMA cleavage bias score and cannot be compared to Fig 1g. If you were talking about the PBS between two samples, the PBS is similar among samples in the same system (unless the data is highly affected by the batch effect) and might be largely different in different biological system (e.g., celltypes, tissues, species)
The avebias(PBS) reflected the feature of the input data (data driven) and can always be used to correct the bias effect in the single cell clustering.

"correct way to apply SELMA to many 10x batches" the peak detection could be done after batch effect removal (with whatever method you preferred). Then you can run the SELMA sc mode with manual input of the peak you just called and the combined fragments from batches.

"What is the correct way to interpret how avebias affects data" I haven't tried scVI on scATAC-seq data and compared with the intrinsic cleavage bias effect. I would be interesting to see how the intrinsic cleavage bias works in their deep generative model. If I would like to considering integrate intrinsic cleavage bias in your model, I'd prefer to use the SELMA bias score (see point.a in the first paragraph, the bias discussed in Fig 1) in stead of the avebias(PBS). For example, every read/insertion would be affected by the bias and the effect can be estimated differently according to the sequence it insert, the chromatin environment of the local region (e.g., accessibility, histone mark / DNAme feature, potential chromatin remodeler binding)

PBS is an integrative score generated from SELMA bias score, and is specifically designed for scATAC clustering analysis. If you would like to integrate it into the deep generative model, why not starting from the SELMA bias score and do your own integration?

Hope that helps Let me know if you have further questions. Best Shawn

vitkl commented 1 year ago

Hi Shawn @Tarela

Thank you so much for this level of detail. It is very helpful.

scVI doesn't account for the tn5 bias. I would like to understand how to modify that model to account for this bias using the outputs of your package. scVI and other scATAC models operate on the count matrix of peaks * cells which contain cut site or fragment counts (these models don't operate on fragments but on per peak per cell summary).

The intrinsic cleavage bias of Tn5 [SELMA bias score]

This value can be computed per fragment/cut site, right? If 1) PBS/avebias is simply the average of [SELMA bias score] across all cells and all fragments that map to a given peak and 2) the model operates at the level of per peak counts - wouldn't it make sense to use PBS as a weight?

Or are you saying that it would be better to compute the average of [SELMA bias score] across all fragments within one cell to have a tn5 bias prediction for every peak and every cell?

If yes, how can this number be derived by applying your package to scATAC fragments file?

vitkl commented 1 year ago

One more question about the properties of the intrinsic cleavage bias of Tn5 (any reference would be great): 1) Does Tn5 cleave in closed chromatin (e.g. nucleosome protected) when the sequence matches Tn5 preference? 2) Does Tn5 cleave only in the set of genomic regions with open chromatin but it more frequently cleaves when the sequence matches Tn5 preference? 3) we don't know

Tarela commented 1 year ago

Hi Vitkl,

if you only want a peakXcell level correction results, SELMA actually provided a PBS corrected peakXcell matrix for your further analysis. In order to get it, you can simply add --keep-tmp parameter in the "SELMA sc mode", and in the tmpResults/ folder. You will find a plain text file named as outname_correctionMat.txt, which is the PBS corrected peakXcell matrix you are looking for. See details of correction in Figure 6 and Methods section for the computation of the PBS and correction method.

"Or are you saying that it would be better to compute the average of [SELMA bias score] across all fragments within one cell to have a tn5 bias prediction for every peak and every cell?"

I don't have the perfect answer for that. The current correction in scATAC analysis is the best way we can do, and provide the best results in all the real scATAC data (across different platform) we've tested.

However, I guess correction of ATAC data (in both bulk and sc) on reads/fragments level should be a higher level and better way. For example, for every fragment you observed in the ATAC-seq data, and you know the cleavage sequence and SELMA bias score of it. Then you design a model to correct/re-weight every fragment. By doing this, you are actually correcting the bias in the very basic level. We've tried a lot of methods in this way but unluckily none of them provided satisfied answer. Since you were probably working on the deep generative model (ScVI), you can probably think of this issue in the above way. Hopefully the deep learning model will provide greater power in this analysis.

BTW, "Or are you saying that it would be better to compute the average of [SELMA bias score] across all fragments within one cell to have a tn5 bias prediction for every peak and every cell?" Seems not a good way :(

For your 2nd message: "Does Tn5 cleave in closed chromatin (e.g. nucleosome protected) when the sequence matches Tn5 preference?" Yes, clearly observed in the ATAC-seq data, even for the sequence that Tn5 doesn't like. It's actually a possibility of every region to be inserted by Tn5, and the possibility is affected by chromatin structure, Tn5 cleavage bias and other features. Even if a region gets a really low possibility due to the low value of the features (with closed chromatin, low bias and for example low GC%), it still gets a tiny chance to be inserted, and whatever impossible things can happen in a 3e9 bp length genome in NGS data :).

"Does Tn5 cleave only in the set of genomic regions with open chromatin but it more frequently cleaves when the sequence matches Tn5 preference?" I believe open chromatin have higher/stronger effect on the ATAC-seq reads distribution, although the effect of cleavage bias cannot be ignored.

vitkl commented 1 year ago

Hi @Tarela

Thanks again - great discussion.

1.

BTW, "Or are you saying that it would be better to compute the average of [SELMA bias score] across all fragments within one cell to have a tn5 bias prediction for every peak and every cell?" Seems not a good way :(

How is the PBS-corrected peakXcell matrix computed?

Does PBS-corrected peakXcell matrix still have integer counts of cut sites or are cut site counts weighted by the SELMA bias score? For many models, it is preferable to have integer counts and find the effect of the SELMA bias score on the expected Poisson rate within the model.
Re Tn5 bias mechanisms.

I believe open chromatin have higher/stronger effect on the ATAC-seq reads distribution, although the effect of cleavage bias cannot be ignored.

Does this mean that the cleavage bias mainly manifests in increasing cut site frequencies in closed chromatin?

Tarela commented 1 year ago

Sorry for the late reply.

Intuitively for the peaks with extreme lower PBS or higher PBS, we assigned lower weight to the peaks; for the peaks with moderate PBS, we assigned higher weight for the peaks. The relationship between PBS and weight are data driven (from scATAC data from different cell types and different experiment platforms). See the figure 6 and related method section for more details.
the corrected/weighted peakXcell matrix is not integer counts but we transform it into integer (e.g., round) for the input of some state-of-art method (i.e., APEC)
not always increase cut site frequencies. I tend to understand it as a weight (refer to point1)

Tarela / SELMA

scATAC: how to start from fragment.tsv.gz & per-peak tn5 bias estimates #2