cellgeni / STARsolo

wrapper scripts for convenient STARsolo processing of 10X and other scRNA-seq
GNU General Public License v3.0
44 stars 5 forks source link

Reads spanning multiple genes in bam file #8

Closed ireferraris closed 6 months ago

ireferraris commented 7 months ago

Hi,

I have a question related to RNaseq technology 3' UTR samples, specifically Alithea technology, because pools STAR sorted bams have about 40% reads with portions that map to different genes far apart. This phenomenon is reduced on human samples, but these reads amount is very high on plant samples (bean and chickpea).

The command I used is:

STAR --runMode alignReads --outSAMmapqUnique 60 --runThreadN 16 --outSAMunmapped Within --limitBAMsortRAM 400274367879 --soloStrand Forward --quantMode GeneCounts --outBAMsortingThreadN 16 --genomeDir ../new_ref_genome --soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 14 --soloUMIstart 15 --soloUMIlen 14 --soloUMIdedup NoDedup 1MM_All --soloCellFilter None --soloCBwhitelist barcode.txt --soloBarcodeReadLength 0 --soloFeatures Gene --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --outFilterMultimapNmax 1 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix STAR --readFilesIn R2_001.fastq.gz R1_001.fastq.gz

However, I verified that performing adapter trimming before mapping reduces the number of such "spanning" reads, despite the fact that this step is generally considered unnecessary and by manual it is consigned to skip the adapter trimming step.

Thanks in advance, Irene

apredeus commented 7 months ago

Hi @ireferraris - I can't be sure without looking at the actual data, and I'm also not sure what "40% reads with portions that map to different genes far apart" means. Do you mean splicing? If so, are those splice junctions annotated in your GTF? Or do you mean multi-mapping?

What I would do is debug your issue separately from STARsolo, but just using STAR (in bulk mode) with different settings that are relevant to splicing and multi-mapping. Plant genomes are often repetitive, and your experiment might benefit from carefully considering mapping options. However, to me, the trimming part sounds very suspicious, because 3' single-end 10x reads should not have any adapters in the biological read.

STAR has many very useful statistics in its output, and you should look at those carefully. Especially pay attention to 1) fraction of canonical and non-canonical, as well as known and novel splice junctions; 2) fraction of multimappers; 3) overall mapping rate; 4) mapped read length (are many bases getting soft-clipped?); 5) how many reads that were mapped are assigned to genes (STAR can count those for you). Also take a look at the strand-specificity, there might be an issue there.

Good luck