DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

Expected or acceptable overall alignment rates #293

Open sjfleck opened 3 years ago

sjfleck commented 3 years ago

I'm just wondering about expected or acceptable overall alignment rates with HISAT2. I have a chromosome level plant assembly (~600 Mb and identified ~92% BUSCOs) and 3 biological replicates for RNAseq data. These were my commands:

hisat2-build -p 16 $REF $SAMPLE hisat2 --max-intronlen 20000 -p 16 --dta -x $SAMPLE -1 $READS1A,$READS1B,$READS1C -2 $READS2A,$READS2B,$READS2C -S $SAMPLE.sam

and this was my output for all the RNAseq reads together:

67526744 reads; of these: 67526744 (100.00%) were paired; of these: 24416875 (36.16%) aligned concordantly 0 times 40845701 (60.49%) aligned concordantly exactly 1 time 2264168 (3.35%) aligned concordantly >1 times

24416875 pairs aligned concordantly 0 times; of these:
  2000368 (8.19%) aligned discordantly 1 time
----
22416507 pairs aligned 0 times concordantly or discordantly; of these:
  44833014 mates make up the pairs; of these:
    34610470 (77.20%) aligned 0 times
    9502269 (21.19%) aligned exactly 1 time
    720275 (1.61%) aligned >1 times

74.37% overall alignment rate

I've taken a look at many other issues and I'm mostly seeing ~95% overall alignment rate. Is my ~75% an issue and if so, do you have any recommendation for options that might increase that value? I trimmed the RNAseq reads using Trimmomatic and used Trinity's parameters for running Trimmomatic (ILLUMINACLIP:$adapters/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:5 MINLEN:25). The easiest thing I can guess is that I was too conservative with the --max-intronlen flag. Do you have a recommendation for plants? Thank you for your time

I also ran the 3 replicates separately and here are the 3 overall alignment rates:

79.64% overall alignment rate 75.35% overall alignment rate 67.55% overall alignment rate

sjfleck commented 3 years ago

I want to give an update. I'm doing similar analyses on a few closely related plant species, but only the initial example is using a chromosome-level hybrid assembly. For each species, I have 3 biological replicates for RNA-seq data. This time, the reference was a MaSuRCA assembly (~590 Mb; 133,434 contigs; 81% complete BUSCOs, 8.1% fragmented BUSCOs). The HISAT2 results all 3 RNA-seq biological replicates together were:

56860312 reads; of these: 56860312 (100.00%) were paired; of these: 11384790 (20.02%) aligned concordantly 0 times 42339792 (74.46%) aligned concordantly exactly 1 time 3135730 (5.51%) aligned concordantly >1 times

11384790 pairs aligned concordantly 0 times; of these:
  1876601 (16.48%) aligned discordantly 1 time
----
9508189 pairs aligned 0 times concordantly or discordantly; of these:
  19016378 mates make up the pairs; of these:
    10638873 (55.95%) aligned 0 times
    7229999 (38.02%) aligned exactly 1 time
    1147506 (6.03%) aligned >1 times

90.64% overall alignment rate

And these are the individual overall alignment rates:

89.35% overall alignment rate 89.86% overall alignment rate 92.55% overall alignment rate

These were a bit higher than for my first species. The biggest differences is that this one used a highly fragmented MaSuRCA assembly, but my first one was using a chromosome level hybrid assembly (with a higher BUSCO score). This 90% is much better, but still a little lower than other issues I've read. I'm not sure if this helps at all, but I felt like more info might help.