alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

small RNA mapping settings #891

Open brobr opened 4 years ago

brobr commented 4 years ago

Hi, I have been using STAR for a while now but still cannot find the right settings that give a satisfactory mapping speed and efficient mapping. I have short RNA sequencing reads (sRNAs, 22-24nt) to map to a fungal genome (~19.4 MB) with high GC (~47%) and high frequency of introns (5-7/gene) which are between 30-70 bp (~55 in most cases).

At the moment I thought I had OK settings but checking the Unmapped.out.mate1 there are still a lot of fine reads that should have mapped. Many of these are multimappers (17x for some) but also a uniq mapper (in the small sample I checked)

As I have run these samples so often I generated a file with all the splice-sites mapped and use one-pass with that file as reference. The index in the genomeDir has been generated with SAdx=11 (--sjdbGTFfile the published reference GTF and --sjdbOverhang 27). These are the runmode settings:

STAR \ --runThreadN 4 \ --genomeDir "$REF_DIR" \ --readFilesIn "${SRC_DIR}/$SEQ}.fastq" \ --sjdbFileChrStartEnd "${SJDB}" \ --sjdbGTFfile "$CRYPr_GTF" \ --outReadsUnmapped Fastx \ --outMultimapperOrder Random \ --outSAMtype BAM Unsorted \ --outFilterMismatchNmax 0 \ --outFilterMatchNmin 12 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --outSJfilterOverhangMin 6 4 4 4 \ --outFilterMultimapNmax 50 \ --seedSearchStartLmax 12 \ --alignIntronMin 20 \ --alignIntronMax 200 \ --alignSJoverhangMin 3 \ --alignSJDBoverhangMin 2 \ --alignEndsType EndToEnd;

Mapping takes now about 6-7 min per sample which is great, when I had before settings that took an hour or more to complete.

# --seedSearchStartLmax 10 and # --outFilterType BySJout \ # --alignIntronMax 1000000 \ # --seedMultimapNmax 10000 \ # --winAnchorMultimapNmax 1000 \ # --alignEndsType Extend5pOfRead1

But now I lose perfect mappers which is not so great (~50% end up in Unmapped.out.mate1). I wonder whether to run STAR twice once with looking for introns and on the remainder without (how is that set up easily?? is there one setting that skips searching for introns??). But I think that defeats the purpose and complicates the workflow quite a bit. And would it help retrieve those lost mappers?? Do I need to use another mapper?? But I ended up using STAR because it finds introns and I have great examples where reads are mapped across introns showing that mature transcripts are targets of the anti-sense sRNAs sequenced, awesome evidence!

Thanks for the program, I only would love to get it going optimally; there is so much to get through... Looking forward to any advice...

BTW as mentioned elsewhere; please do keep wet-scientists as a users group in mind. They love pictures to understand stuff, at least I do. The number of possibilities and settings are quite baffling and a good image that links settings to what is done will be extremely helpful. Not all use cases adhere to the default settings or aim to map human RNA. There is a lot of improvement possible at the bioinformaticians' site when it comes down to communicating what particular settings or functions do and what they are there for. Sorry, but the irritation comes a) from the time that can be wasted on getting it right and b) the insecurity about the outcome combined with pressure to deliver stuff for publication... Note this comment is not STAR-specific but that's where most of my time has gone into now. I even maintain the software for slackware linux, but that doesn't help becoming an expert in its usage...obviously...

alexdobin commented 4 years ago

Hi @brobr

what was your reasoning to use more exhaustive search parameters? I am not sure why it reduced the number of mapped reads if you send me a few perfect matches that are not mapped, I can look into it.

Mapping short reads exhaustively (i.e. finding all alignments with a given number of mismatches) is a very hard task. STAR (and no other aligners, as far as I know) is not designed to do that. STAR should be able to find perfectly matched alignments, including those splices to annotated junctions. Using the more aggressive search options might find some of the alignments with mismatches, but it's not guaranteed. I think you can try bwa (original, not -mem) to do that, it might do a better job.

Cheers Alex

brobr commented 4 years ago

Hi Alex, thanks for the prompt response.I have to apologize; I cannot reproduce the list of non-mapped reads (it lacked the quality scores and you might needed those). When redoing head -n 100 on the files used before the unmapped reads that were now listed had been rejected for a good reason (mismatch somewhere or not found at all).

With respect to the settings: The default settings (used first) left a lot of unmapped reads behind. But this would possibly be caused by having a library of such short reads.

So I tried things, some of which inspired on some earlier posts where people ran into missing short reads of which many could be multimappers (the advice was upping the --seedMultimapNmax).

The other day, when trying the Jellyfish kmer analyzer (used to estimate genome size; or in my case, reversely, find the minimal length for a k-mer that does not map too often) I wondered whether the seed-length (as I understood the --seedSearchStartLmax to mean) of 10 (set at that size to be able to enhance the chance of finding reads with a mismatch in the middle) could have been too low; each 10-mer can find ~11 positions in the genome. But that did not seem to have caused the problem (see below).

Allowing for 1 mismatch and adding: --seedSearchLmax 28 --seedSplitMin 9 reduced the unmapped reads to about 10% (where I have had ~30% before)

Some testing indicates that the --seedSearchStartLmax 10 remains better unset, the test-file took longer (~13 min vs. ~8.5 min without) and scored slightly less mappings (so this setting regulates the skipping of the set no of nt before mapping starts?).

The default --alignIntronMax 1000000 is the biggest culprit for making mapping to take so long earlier (and giving weird mapped reads with enormous introns as seen in a bam-viewer like Tablet). Leaving it at that (by taking out --alignIntronMax 600) leads to a processing time of ~35 min! Slightly less un-mapped reads though, but these could be the ones seen before that made me worry about the obtained alignments in the first place).

Below an genome browser illustration of two genes of a transposon with different introns as officially annotated (black line) and as corrected (indicated by green lines) based on published RNAseq data (orange-yellow at top and bottom), which also suggest some intron skipping. As you can see the mapped sRNA reads (blue and green) do not map over intronic regions of their target on the opposite strand (see for example the intron with the thin line on the right).

Screenshot_2020-04-24_19-20-17

Also the big intron in the gene encoded on the upper strand -which is not always spliced out- appears to be supported by split sRNA-reads (see the below screen shot made of Tablet).

Screenshot_2020-04-24_20-34-27

This is so cool; it means that the transcripts are spliced before they get 'seen' by the RNAi machinery.

Eventually I decided to settle on this 'recipe' (starting with samples from which reads mapping sense to the rRNA have been removed with bowtie2, just to speed up HTSeq counting downstream):

STAR \ --runThreadN 4 \ --genomeDir "$REF_DIR" \ --readFilesCommand gunzip -c \ --readFilesIn "${SRC_DIR}${input}.fastq.gz" \ --sjdbFileChrStartEnd "${SJDB}" \ --sjdbGTFfile "$CRYPr_GTF" \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outMultimapperOrder Random \ --outSAMtype BAM Unsorted \ --outFilterMismatchNmax 1 \ --outFilterMatchNmin 12 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --outSJfilterOverhangMin 6 4 4 4 \ --outFilterMultimapNmax 50 \ --seedSearchLmax 28 \ --seedSplitMin 9 \ --alignIntronMin 20 \ --alignIntronMax 450 \ --alignSJoverhangMin 3 \ --alignSJDBoverhangMin 2 \ --alignEndsType EndToEnd;

Hopefully this combination of settings makes sense, or should some better be left out/changed?

Cheers, Rob

brobr commented 4 years ago

Hi Alex, just to mention a change in the above set of parameters; two settings were added, --outSAMattributes NH HI AS nM MD (using the MD:Z output to catch the place of a mismatch in a mapped read) and --outSAMmultNmax 1, in order to only account once for a randomly placed multimapping read.

Combining --outSAMmultNmax 1 with --outMultimapperOrder Random resulted in a better representation in the bedgraph (made on a samtools coordinate-sorted bamfile): some non-uniq peaks became (far) less prominent (and the bam-file with mapped reads has become smaller (less repeated lines)).

The mapping speed with these settings was slightly faster (312.36 million reads/hr vs 309.59).

Rob

alexdobin commented 4 years ago

Hi Rob,

great work on parameter optimization, and interesting results!

I think your parameter choices make a lot of sense. Reducing --alignIntronMax is definitely a good idea, both from the point of view of the small fungal genome (no space for large introns), and short reads (easier to place short splice overhangs close to each other).

It's a bit surprising that reducing --seedSearchLmax did not improve mappability, was the total number of mapped reads reduced (unique+multiple), or just unique? This parameter defines the start positions of the seed search - the seed length is not fixed, rather the exactly mapped length is maximized. With 28, the ~24nt reads are only going to be searched from both ends. With 10, there would be one extra search start point in the middle of the read - which should result in more mappable reads with mismatches and novel splices. However, the no-mismatch and annotated spliced alignments are not affected, and since they constitute most of the alignments, it would not affect the results significantly.

Cheers Alex