COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
747 stars 159 forks source link

Applying Salmon Alevin to other single-cell modalities #611

Open AnnaAMonaco opened 3 years ago

AnnaAMonaco commented 3 years ago

Hello! This might be an unusual request, but I have been attempting to use Salmon Alevin for mapping of scATAC-seq data (10X Genomics platform) and have a few questions on how to best do this (/whether it is possible at all!).

A bit of background: I have two sequenced libraries of the same cross-species hybrid: scRNA-seq 10X v3 and scATAC-seq 10X (alas, not from the same cells). My aim is to use Salmon Alevin on both, using a home-made diploid reference so that down the line I can do allele-specific analysis on both datasets and then compare allelic imbalance.

What I have done: I generated a diploid reference "feature set" for the scATAC-seq, to mimic in a way how the reference transcriptome is interpreted. I did this by using bulk ATAC data from both species, calling peaks with MACS2, and then using this peak set (BED format) to extract strain- and species-corrected sequences (fasta) to use as a reference.

My questions/issues: Considering the basic Alevin invocation: salmon alevin -l ISR -1 cb.fastq.gz -2 reads.fastq.gz --chromiumV3 -i salmon_index_directory [--whitelist barcodes.tsv] -p 10 -o alevin_output

-1 cb.fastq.gz —> these should be CB+UMI reads, but in 10X ATAC there are no UMIs, so this would only be a 16bp CB. I'm afraid this would be a major issue considering how Alevin relies on UMI whitelisting. Would it help if I already provide a list of filtered whitelisted CB, as provided by Cell Ranger? I have this from a quality control alignment to only one parental species.

--chromiumV3—> If I understand correctly, this tells the program to search for the first X bases of cb.fastq.gz as CB, and the first Y bases of it as UMI (as per this thread). So technically one could swap --chromiumV3 for --end 5 --barcodeLength X --umiLength Y and set this manually? But again, what if I only have CBs?

-2 reads.fastq.gz —> I understand this is meant for 10X scRNA-seq - which only has insert information on read2, but in 10X scATAC both insert read1 and read2 contain information on open chromatin. I reckon this is the major head scratcher for me, whether it is possible to use Alevin on scATAC at all. Should I merge the fastq read files into a single “reads.fastq.gz”? Or should I treat them as independent (but with the same CBs) and run something along the lines of: salmon alevin -l ISR -1 cb.fastq.gz cb.fastq.gz -2 read1.fastq.gz read2.fastq.gz (…etc)? Or maybe there is a way I am not aware of that Alevin can handle paired-end data?

Thank you in advance and I hope I have been detailed and clear enough. Please ask for more details where needed :)

rob-p commented 3 years ago

Tagging @k3yavi on this one!

k3yavi commented 3 years ago

Hi @AnnaAMonaco ,

Thanks for reaching out and I agree it'd be super useful to have alevin working for both scRNA-seq and scATAC-seq multiome datasets. In short I'd say the framework is not ready yet and there are multiple challenges which we are still working-on to find the right solution. The Central issue is that the technologies to profile open-chromatin regions expects the read to align majorly to non-coding regions and salmon/alevin framework is designed to work (generally) with transcriptomic data. Having said that, one can potentially index the full genome using salmon indexing but we have not yet extensively validated the genomic alignment generated from alevin framework. Once settled, we can certainly figure out ways to run alevin without UMI, that's the easier part.

What do I do now ? Basically since the scRNA-seq and scATAC-seq are two different library preps (along with the fastq), I'd still recommend using alevin for scRNA-seq, however, one might have to run other tools (like bwa-mem) to align scATAC-seq data. The are multiple reasons to recommend that, the significant power of alevin comes in with (1) multi-mapping reads but we generally expect low number of such reads with ATAC-seq data (2) UMI deduplication which is absent in the ATAC-seq data and the deduplication happens based on the aligned position.

Again, I agree it's great to have a uniform workflow for the multiome data but we are thinking about the challenges in designing such workflow and how solve them. We'd let you know once we have a vignette / tutorial.

-- Avi

rob-p commented 3 years ago

Hi @k3yavi and @AnnaAMonaco,

There was some workflow I recall reading that extracted intervals from the genome (bins) and then essentially treated these as transcripts. It seemed like it might work reasonably well. Perhaps one can think of building a workflow around that approach. I’ll see if we can gather / think through more details on that front.

AnnaAMonaco commented 3 years ago

Hi @k3yavi,

Thank you for your reply!

The Central issue is that the technologies to profile open-chromatin regions expects the read to align majorly to non-coding regions and salmon/alevin framework is designed to work (generally) with transcriptomic data. Having said that, one can potentially index the full genome using salmon indexing but we have not yet extensively validated the genomic alignment generated from alevin framework.

Yes, this was the first issue I encountered and tried to work around. My attempt to a solution was - as both you and @rob-p suggested - to find a way of binning the reference so it would be treated similarly to transcripts. The underlying thought was that if for scRNA-seq the reference transcriptome acts as a collection of features that the reads are "compared" to, the equivalent for scATAC-seq would be known peaks of open chromatin in the given organism/developmental stage/tissue/etc. But please correct me if I'm wrong in this assumption, I am a wet lab-trained biologist trying to understand the analysis tools to my best capability.

So at the moment I have a "reference peak set" for my scATAC-seq data. I generated this by combining MACS2-called peaks from bulk ATAC-seq I have for the same developmental time-point of both parental and hybrid lines (as I mentioned, I'm working with cross-species hybrids). I used the coordinates from this mergedPeaks.bed to extract a mergedPeaks.fa from the whole referenceGenome.fa, does this sound reasonable?

-Anna

k3yavi commented 3 years ago

Hi @AnnaAMonaco ,

Thanks for clarification, I think it does makes sense to use a "reference peak set" from bulk-ATAC-seq and quantify against it. I myself am planning to use encode defined cCREs https://screen.encodeproject.org/ to do something similar on human samples. The only issue I can think of with this approach is that you can potentially lose quality checks on full genome coverage data which is quite frequent in scATAC-seq data for example Nucleosome banding pattern, TSS enrichment etc. In case you are using Seurat/Signac most of the visualization uses fragment file on Genomic coordinates which can also be effected.

BiotechPedro commented 1 year ago

Hi super-team !

We're planning to move some of our analyses to alevin-fry, specially after realising about the feature barcoding preprocessing option.

However, we would like to know how is it going with the ATAC-seq (or even multiome) processing... can you tell us something? :D

Thank you!!

Pedro