Filtering variants - Githubissues

caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit

http://caleblareau.github.io/mgatk

MIT License

98 stars 25 forks source link

Filtering variants #26

Closed PasLukas closed 3 years ago

PasLukas commented 3 years ago

Hi there. When working with my own scATAC-seq data derived from pre FACS-sorted cells I was never able to identify mitochondrial variants which achieve the thresholds (log10(VMR) > -2 & strand > 0.65) used in filtering the called variants e.g. in the IdentifyVariants function in signac (1.0.0) or in the example workflow for mtscATAC-seq data. I wanted to ask if the thresholds are only suitable when working with mtscATAC-seq data or can also be worked with when using basic scATAC-seq data. And if not, do you have any recommendations which thresholds to use for basic scATAC-seq data?

Thank you and greetings.

caleblareau commented 3 years ago

I specifically avoid using the regular 10x scATAC data because the coverage is almost always too bad to work with... So I don't have any immediate recommendations...

Did you try filtering for cells that have at least a mean coverage of 10x? I think that there would be better sensitivity with fewer cells but those with decent coverage

mathosi commented 3 years ago

Hi Caleb,

even with a high sequencing saturation (>15,000 fragments per cell) and the custom hard-masked reference genome for CellRanger-ATAC, less than 1% of the cells reach a coverage of at least 10x for the mitochondrial genome. Changing this coverage threshold to include more cells or only keep cells with high mt-genome coverage did not improve the detection of high confidence variants. Do you have further suggestions on how to enable lineage tracing with regular 10x scATAC data, or is this currently really just possible with mtscATAC-seq?

Best, Malte

caleblareau commented 3 years ago

The standard 10x scATAC-seq protocol is in principle a "single nucleus" prep. So by definition, there aren't going to be a very high abundance for mitochondria. In fact, whatever reads one observes reads aligned to mitochondria, these may also be noise based on the biological input. Further, without cellular fixation, we saw that there is significant "crosstalk" of mitochondria between nuclei (Fig 1 of mtscATAC-seq paper). For all of these reasons, I'm very skeptical that one could get anything sensible from the prescribed 10x scATAC-seq experimental input.

mathosi commented 3 years ago

Thanks for your reply. In the first publication on the topic (Ludwig et al. 2019), it was suggested that the workflow can also be used with single cell sequencing protocols other than mtscATAC-seq such as SMART-seq2 or scATAC-seq with the C1 Fluidigm platform. Would you recommend working with these data to infer the clonal architecture or will the missing information on strand concordance and possibly a low coverage of the mitochondrial genome prevent the determinaion of high-confidence variants?

caleblareau commented 3 years ago

I would of course recommend using mtscATAC-seq data if you can :)

We showed in the Supplement of Figure 3 of the mtscATAC-seq paper that the variant filtering approach works with Smart-seq2 data. I didn't try the C1 explicitly, but it should work.

In principle, we optimized the variant filtering strategy that is being discussed based on strand concordance and VMR to specifically utilize properties of mtscATAC-seq (i.e. uniform coverage, high cell number). We employed other filtering strategies in the 2019 paper, and of course, there are straightforward applications of variant callers like FreeBayes to data from other single cell methods. Those may work best depending on exactly your data input.