alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 501 forks source link

How can I align small RNA fragments (not miRNA)? #1271

Open ssun1116 opened 3 years ago

ssun1116 commented 3 years ago

Dear Alex,

Hello, Alex. Thank you for developing such a great tool. I've just started learning about bioinformatics, and I'm really interested using your tool to analyze various RNA-seq data.

Recently, I got a unusual RNA-seq data - and I'm been struggling to analyze this data for more than two weeks. I would like to ask you about the STAR alignment options that I can apply to my dataset.

It's RNA-seq data fragmented by restriction enzyme mazF which cuts RNA at a particular sequence (ACA). The library was constructed with SMARTER smRNA-Seq Kit, so after trimming the adapter sequences, the read length varies from 15 bp to 98 bp - depending on the mazF activity.

I found an advice from you about aligning miRNA data - https://groups.google.com/g/rna-star/c/RBWvAGFooMU

--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

But I was wondering if I can apply these options to align this fragmented reads with variable length, because it seems like miRNA is little bit dfferent from enzyme-fragmented RNA. Could you please check if it is okay to use these options for my dataset? And, is it okay to generate the genome index by setting the sjdbOverhang value to 100 ?

Thanks for reading this post, and wish you have a good day :)

Sincerly, Seoyeon.

alexdobin commented 3 years ago

Hi Seoyeon,

if you are trimming the adapters, and there are no "unmatchable" bases in the reads, and so you do not need to use the small RNA parameters. The default --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 parameters are scaled to read length, so they should work fine to variable read length. --alignIntronMax 1 will prohibit splicing, and you probably do not want to that.

Tuning --outFilterMismatchNoverLmax 0.05 is a good idea, as the default value (0.3) is probably too big for most purposes.

Cheers Alex

ssun1116 commented 3 years ago

Dear Alex,

Thank you so much for your advice. But Alex, unfortunately, when I applied those settings to my data, too much reads are flitered out because of the short length. About 70 % of reads were unmapped because they were "too short".

How can I tweak them to be uniquely mapped? May I should change the genome index?

Sincerly, Seoyeon

alexdobin commented 3 years ago

Hi Seoyeon,

could you please post the Log.final.out statistics, for both relaxed filtering (your original) and stringent settings (my suggestion)?

Cheers Alex

ssun1116 commented 3 years ago

Dear Alex,

I really appreciate you for your help. Here are the two log files.

I applied this parameters

outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 --outSAMtype BAM Unsorted --outSAMunmapped Within \
--quantMode TranscriptomeSAM GeneCounts --twopassMode Basic

And this is the log.final.out file.

                             Started job on |    Jun 19 16:19:09
                             Started mapping on | Jun 19 16:26:42
                                    Finished on | Jun 19 16:32:47
       Mapping speed, Million of reads per hour | 362.78

                          Number of input reads | 36782164
                      Average input read length | 102
                                    UNIQUE READS:
                   Uniquely mapped reads number | 29491888
                        Uniquely mapped reads % | 80.18%
                          Average mapped length | 63.63
                       Number of splices: Total | 6266200
            Number of splices: Annotated (sjdb) | 6234417
                       Number of splices: GT/AG | 6178729
                       Number of splices: GC/AG | 50556
                       Number of splices: AT/AC | 6175
               Number of splices: Non-canonical | 30740
                      Mismatch rate per base, % | 0.46%
                         Deletion rate per base | 0.01%
                        Deletion average length | 1.58
                        Insertion rate per base | 0.00%
                       Insertion average length | 1.36
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 6845121
             % of reads mapped to multiple loci | 18.61%
        Number of reads mapped to too many loci | 340523
             % of reads mapped to too many loci | 0.93%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches | 66625
       % of reads unmapped: too many mismatches | 0.18%
            Number of reads unmapped: too short | 35607
                 % of reads unmapped: too short | 0.10%
                Number of reads unmapped: other | 2400
                     % of reads unmapped: other | 0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%

I applied this code.

--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0.66 \
--outFilterMatchNminOverLread 0.66 --outSAMtype BAM Unsorted --outSAMunmapped Within \
--quantMode TranscriptomeSAM GeneCounts --runThreadN 8 --twopassMode Basic

And this is the log.final.out file

                             Started job on |   Jun 18 10:28:38
                             Started mapping on |   Jun 18 10:36:00
                                    Finished on |   Jun 18 10:41:59
       Mapping speed, Million of reads per hour |   368.85

                          Number of input reads |   36782164
                      Average input read length |   102
                                    UNIQUE READS:
                   Uniquely mapped reads number |   9774580
                        Uniquely mapped reads % |   26.57%
                          Average mapped length |   74.30
                       Number of splices: Total |   2597062
            Number of splices: Annotated (sjdb) |   2590049
                       Number of splices: GT/AG |   2566711
                       Number of splices: GC/AG |   21115
                       Number of splices: AT/AC |   2487
               Number of splices: Non-canonical |   6749
                      Mismatch rate per base, % |   0.39%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.29
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.19
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   1388385
             % of reads mapped to multiple loci |   3.77%
        Number of reads mapped to too many loci |   39921
             % of reads mapped to too many loci |   0.11%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   7346
       % of reads unmapped: too many mismatches |   0.02%
            Number of reads unmapped: too short |   25569506
                 % of reads unmapped: too short |   69.52%
                Number of reads unmapped: other |   2426
                     % of reads unmapped: other |   0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%
ssun1116 commented 3 years ago

May I ask you one more question? I couldn't understand why these average input read length results came out (102 bp).

I have two reads - Read1 and Read2 (like paired-end data). Perhaps because it's small RNA data, they seem to be complementary to each other from beginning to end.

When I checked the input length for each read, the average length was about 50bp, but when I align with STAR, strangely, the average is about 102bp.

In this case, it may be recommended to just use either Read 1 or Read 2. However, since subsequent analyses only receive paired-end reads as input, I used both reads for STAR alignement. Can this affect the outcome?

I'm sorry it might be a silly question. I have no experience dealing with small RNA data, so I have a little lack of understanding of how to analyze the data...

alexdobin commented 3 years ago

Hi Seoyeon,

these are all good questions! STAR uses the total PE length (sum of two mates) to calculate "Average input read length", so it agrees with your estimate of ~2*50.

If all of your reads overlap, it means that "insert size" (i.e. the cDNA length) is shorter than the read length. In this case, indeed, using just one read is sufficient, and downstream analyses will not benefit from the paired-end alignments.

I would recommend that you try mapping Read1 and Read2 separately with stringent parameters. If this results in a much better mappability, it may indicate that the overlapping mates are causing low mappability. There are parameters that can be tweaked to mitigate that.

Cheers Alex