alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

diff in Reads Mapped to Genome and Gene #1568

Open aswathysheena opened 2 years ago

aswathysheena commented 2 years ago

Hi ,

I have built a exclusive endogenous viral reference and generated STAR index(similar to custom reference in cell ranger, each fasta sequence has one entry/annotation in GTF ie; each fasta entry has annotated for each viral strain than gene) eg: one fasta entry corresponds to ERVK-10 in GTF

The idea is to get count matrix for reads mapping to any viral genome.

Then ran STARsolo as follows:

STAR --runThreadN 19 --soloType Droplet \ --soloCBwhitelist 737K-august-2016.txt --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \ --outSAMtype BAM SortedByCoordinate --genomeDir reference/ --outFileNamePrefix {sample} --readFilesCommand zcat \ outFilterMultimapNmax 500 winAnchorMultimapNmax 500 --soloMultiMappers EM \ --readFilesIn {','.join(files2)} {','.join(files1)}

Considering the sequence homology, I have used --soloMultiMappers EM option to get Multi-mapped reads counted.

My query is reg. the metrics summary.csv.

Reads Mapped to Genome: Unique+Multiple,0.0122193 Reads Mapped to Genome: Unique,0.00873406 Reads Mapped to Gene: Unique+Multiple Gene,0.00548684 Reads Mapped to Gene: Unique Gene,0.00441223

In this scenario, what is it difference b/w "Mapped to Genome and Mapped to Gene"

Any details on this topic would helpful,

Thanks, Aswathy

alexdobin commented 2 years ago

Hi Aswathy,

in the BAM file, please check the reads that mapped to the genome but do not have GX/GN tags. One possibility is that these reads are spliced. For viral reads it probably makes sense to disallow splicing.

Also, your parameters outFilterMultimapNmax 500 winAnchorMultimapNmax 500 are missing -- .

Cheers Alex

aswathysheena commented 2 years ago

Thanks Alex.

The total reads mapped (identified using samtools -f 4 -c ) is 496607, while greping empty GX/GN is giving 349504.

How can I disallow splicing while running STARsolo. ? is it --outSJfilterReads .?

alexdobin commented 2 years ago

Hi Aswathy

you can prohibit splicing with --alignIntronMax 1. However, I just realized that there is another reason why the difference you observe: "Reads Mapped to Genome" counts all reads regardless their Cell Barcode or UMI, but "Reads Mapped to Gene" counts only reads that have valid CB/UMI, so it's always a smaller number.

Cheers Alex

aswathysheena commented 2 years ago

Thank you Alex.

The CB/UB are error corrected. Sorry I read the README again.

But can you please tell me what does GX/GN as "-" means? I am getting a handful of same entries with mapped , valid barcode (from whitelist)?what does it indicates?

aswathysheena commented 2 years ago

I did read the Multimapping BAM tag?[ #1280]

I am using version 2.7.10a_alpha_220506. can you help me to understand the "-" GX/GN tags in the bam represents multimapped reads?

alexdobin commented 2 years ago

Hi Aswathy

"-" in GX/GN means that either the read does not map to a gene, or it maps to multiple genes.