COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Question about salmon alevin with single-end data #769

Closed BW15061999 closed 2 years ago

BW15061999 commented 2 years ago

Hi, I run the command salmon alevin -i index -p 4 -l SR --chromium --sketch -r 1.fastq.gz -o ./output with single-end data as input . Although it didn't generate error, it didn't map anything. Can I use two single-end data from different samples as pair-end data to run salmon alevin

Thank you !

and here is a part of the output

[2022-04-23 17:54:35.286] [jointLog] [info] done
[2022-04-23 17:54:37.347] [jointLog] [info] Index contained 127,042 targets
[2022-04-23 17:54:37.628] [jointLog] [info] Number of decoys : 0

[2022-04-23 17:54:38.715] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2022-04-23 17:54:38.715] [jointLog] [info] Counted 0 total reads in the equivalence classes 
[2022-04-23 17:54:38.715] [jointLog] [info] Selectively-aligned 0 total fragments out of 0
[2022-04-23 17:54:38.715] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2022-04-23 17:54:38.715] [jointLog] [info] finished sc_align()
[2022-04-23 17:54:39.453] [alevinLog] [info] sc-align successful.
rob-p commented 2 years ago

Hi @BW15061999,

I’m not aware of any tagged-end single-cell protocol that uses only 1 read. The most common data types place the UMI and Barcode on one of the reads, while the other “biological” reads are drawn from the transcriptome. This is the case with the Chromimum protocol. The reason you are seeing 0 assigned reads is that no barcodes can be extracted, because the second read is missing. Therefore, no reads can be assigned to any cell. What specific protocol are you using? Do you not have the full read pairs for each sample? Cc @k3yavi as the resident protocol guru.

Best, Rob

BW15061999 commented 2 years ago

Hi @rob-p ,

The data downloaded from sra database and use fastq-dump to split it only generate one fastq file, and EBI database only show one fastq file per sample. I am not sure if I process the file correctly

And here is a part of the description of the file on the sra database, and the link of one of the file
SRR8453531

Instrument: Illumina HiSeq 3000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: cDNA
Layout: SINGLE
Construction protocol: The scRNA-seq libraries were generated using Chromium Single Cell 3' Library & Gel Bead Kit v2 (10X Genomic) according to manufacturer's protocol. Briefly, 10,000-15,000 live cells were FACS-sorted and used to generate single-cell gel-bead in emulsion (GEM). After reverse transcription, GEMs were disrupted. Barcoded cDNA was isolated and amplified by PCR (12 cycles). Following fragmentation, end repair, and A-tailing, sample indexes were added during index PCR (8 cycles). Indexed libraries were multiplexed and sequenced on Illumina HiSeq 3000 instruments according to the manufacturer's instructions (26 cycles of Read 1, 8 cycles of i7 Index, and 98 cycles of Read2).

Best

k3yavi commented 2 years ago

Hi @BW15061999 , Yes, this is a known problem for single-cell data uploaded on NCBI. The idea is to download the BAM files of the data (yours should be here under data access section) and then use tools like these to generate paired-end FASTQ files from the BAM file before running alevin. The one downloaded directly from NCBI/EBI doesn't has the CB/UMI components of the paired-reads.

Hope it helps !

rob-p commented 2 years ago

@k3yavi beat me to it! It is, unfortunately, a recurring problem. The SRA file itself only contains one of the reads and is therefore essentially useless in analyzing the single-cell data. This is an ongoing problem that I've mentioned several times, but I don't know if the SRA has a plan in place to address it. The proper solution at this point is exactly as Avi suggests; download the bam file (what the SRA calls the original TenX format data), and run it through 10x's bamtofastq to get back the original fastq files (this time paired-end) that you can process. Let us know if you have success with this.

Best, Rob