handling a scATAC custom reference genome from 10x

aryarm commented 2 years ago

scATAC-seq BAM files from CellRanger can be used with VarCA, but there is an important caveat that one should be aware of:

tldr; if you give VarCA BAM files, you must provide the exact same reference genome as was used to align reads for the BAM!

The reads in the BAM files from CellRanger are often aligned against a custom reference genome. This can cause issues if you try to use a standard (non-custom) reference genome with VarCA, especially if any of the chromosomes are named differently between the custom reference and the standard one. In the past, I have found that the non-canonical chromosomes are usually different.

Option 1

One option is to discard reads belonging to non-canonical chromosomes (since these are usually less than 1% of the total number of reads) and then alter the header of the BAM to match the contigs in the reference:

# 1) extract only reads belonging to chromosomes 1-22 and chromosome X, Y, and M; note that this does not change the header
# 2) convert the BAM to SAM
# 3) remove all SQ tags in the header; these list the contigs in the reference
# 4) replace contig names in the header with those from the reference genome (using the
#    approach in https://www.biostars.org/p/289770/#289782)
#    We use "-f2 -F4" to drop any pairs where a single mate mapped to a non-canonical contig,
#    and we redirect stderr to /dev/null to ignore warnings about such reads.
#    We also drop any secondary alignments "-F256"
samtools view --no-PG -bh old.bam chr{1..22} chr{X,Y,M} | \
samtools view --no-PG -h | \
grep -Ev '^@SQ' | \
samtools view --no-PG -f2 -F4 -F256 -bhT reference.fa - 2>/dev/null >new.bam
# Note that we take a slightly different approach from the one described in
# https://www.biostars.org/p/289770/#289782 by using grep to remove the @SQ tags
# instead of nuking the header altogether, allowing us to retain the @RG and other tags
# Lastly, we must index the resulting BAM file for VarCA
samtools index new.bam

You should try this strategy only after verifying that the canonical chromosomes in your standard reference are the same as those in the custom reference. You can use samtools idxstats for this.

Option 2

Another option is to provide VarCA with the custom reference, if you have it. I haven't tested this but theoretically it should work.

Option 3

Finally, the last option is to provide VarCA with FASTQs (if you have them) rather than BAMs. VarCA will then redo the alignment step and create its own BAM files. I hesitate to recommend this option because I haven't tried it, and I doubt that the alignment performed by VarCA will be as robust to issues with scATAC. The alignment step in VarCA was initially designed with bulk ATAC in mind. However, for some users this might be the simplest and easiest strategy, especially if you're willing to perform alignment twice.

xiachenrui commented 2 years ago

Hi Aryarm, Thanks a lot for yout suggestions! To be honest, I prefer Option3 because it seems to be the easiest one. I notice you pointed

I doubt that the alignment performed by VarCA will be as robust to issues with scATAC.

Can you give me more suggestion about this?

Xia Chen-Rui

aryarm commented 2 years ago

Hi @xiachenrui ,

Have you tried option 2 first? Option 2 will be easier than option 3 if you're able to use it. If option 2 doesn't work, can you explain why? (I'm just curious, myself.)

Can you give me more suggestion about this?

I was just trying to say that the alignment parameters used by 10x may be customized to work better on single cell data than those that VarCA uses. That's just a guess on my part. To be honest, I'm not really familiar with how cellranger-atac count handles alignment internally.

aryarm / varCA