Weeks-UNC / shapemapper2

Public repository for ShapeMapper 2 releases
Other
29 stars 16 forks source link

High percentage of unpaired reads in alignment #49

Closed Shashankti closed 8 months ago

Shashankti commented 8 months ago

I have been having some issues with running the shapemapper2 pipeline on one of our samples however, the pipeline gives the error:

| Read depth check:
| 100.0% (67/67) nucleotides meet the minimum read depth of 5000
| PASS
| 
| Mutation rate check:
| 71.6% (48/67) nucleotides have positive mutation rates
| above background
| PASS: There is a clear difference in mutation rates between
|       modified and untreated samples.
| 
| High background check:
| 0.0% (0/67) nucleotides have high background mutation rates.
| PASS: Not too many nucleotides with high background mutation rates.
| 
| Number highly reactive check:
| 0.0% (0/67) nucleotides show high apparent reactivity.
| FAIL
|       Possible causes:
|        - DNA contamination
|        - poor mixing of chemical reagents and RNA and/or poor
|          reagent diffusion (if modifying in cells), resulting
|          in low modification rates
|        - expired reagents, resulting in low modification rates
|        - poor reverse transcription conditions, resulting in
|          low adduct read-through
|        - extremely highly structured RNA

I decided to check the alignment stats to try to understand the error and this was the output:

  |BowtieAligner (sample: Denatured) output message: 
  |------------------------------------------------- 
  | 
  | 1745050 reads; of these:
  |   37374 (2.14%) were paired; of these:
  |     29464 (78.84%) aligned concordantly 0 times
  |     7909 (21.16%) aligned concordantly exactly 1 time
  |     1 (0.00%) aligned concordantly >1 times
  |     ----
  |     29464 pairs aligned concordantly 0 times; of these:
  |       221 (0.75%) aligned discordantly 1 time
  |     ----
  |     29243 pairs aligned 0 times concordantly or discordantly; of these:
  |       58486 mates make up the pairs; of these:
  |         34917 (59.70%) aligned 0 times
  |         23562 (40.29%) aligned exactly 1 time
  |         7 (0.01%) aligned >1 times
  |   1707676 (97.86%) were unpaired; of these:
  |     1695 (0.10%) aligned 0 times
  |     1705807 (99.89%) aligned exactly 1 time
  |     174 (0.01%) aligned >1 times
  | 97.95% overall alignment rate
  \___________________________________
  |BowtieAligner (sample: Untreated) output message: 
  |------------------------------------------------- 
  | 
  | 1357108 reads; of these:
  |   27491 (2.03%) were paired; of these:
  |     20340 (73.99%) aligned concordantly 0 times
  |     7150 (26.01%) aligned concordantly exactly 1 time
  |     1 (0.00%) aligned concordantly >1 times
  |     ----
  |     20340 pairs aligned concordantly 0 times; of these:
  |       169 (0.83%) aligned discordantly 1 time
  |     ----
  |     20171 pairs aligned 0 times concordantly or discordantly; of these:
  |       40342 mates make up the pairs; of these:
  |         23197 (57.50%) aligned 0 times
  |         17137 (42.48%) aligned exactly 1 time
  |         8 (0.02%) aligned >1 times
  |   1329617 (97.97%) were unpaired; of these:
  |     1298 (0.10%) aligned 0 times
  |     1328294 (99.90%) aligned exactly 1 time
  |     25 (0.00%) aligned >1 times
  | 98.23% overall alignment rate

  |BowtieAligner (sample: Modified) output message: 
  |------------------------------------------------ 
  | 
  | 1693843 reads; of these:
  |   29534 (1.74%) were paired; of these:
  |     23151 (78.39%) aligned concordantly 0 times
  |     6383 (21.61%) aligned concordantly exactly 1 time
  |     0 (0.00%) aligned concordantly >1 times
  |     ----
  |     23151 pairs aligned concordantly 0 times; of these:
  |       187 (0.81%) aligned discordantly 1 time
  |     ----
  |     22964 pairs aligned 0 times concordantly or discordantly; of these:
  |       45928 mates make up the pairs; of these:
  |         25024 (54.49%) aligned 0 times
  |         20901 (45.51%) aligned exactly 1 time
  |         3 (0.01%) aligned >1 times
  |   1664309 (98.26%) were unpaired; of these:
  |     604 (0.04%) aligned 0 times
  |     1663688 (99.96%) aligned exactly 1 time
  |     17 (0.00%) aligned >1 times
  | 98.51% overall alignment rate

I looked at the aligned sam files and I found this sequence to be significantly overrepresnted image

For reference, this is the target.fa file used for the run: >mir-132_RNA taatgggagaccgcccccgcgtctCCAGGGCAACCGTGGCTTTCGATTGTTACTGTGGGAACTGGAGGTAACAGTCTACAGCCATGGTCGCcccgcagcacgcccacgcgcattg It seems that the reverse complement sequence is not read in properly. Can you please let me know what could be the cause of this error, or if this behavior is normal?

Thank you for the help

Psirving commented 8 months ago

First, the initial warning:

| Number highly reactive check:
| 0.0% (0/67) nucleotides show high apparent reactivity.
| FAIL

ShapeMapper2 is indicating that while most nucleotides have a positive mutation rate, that mutation rate is not exceptionally high. This could be for any of the reasons listed

|       Possible causes:
|        - DNA contamination
|        - poor mixing of chemical reagents and RNA and/or poor
|          reagent diffusion (if modifying in cells), resulting
|          in low modification rates
|        - expired reagents, resulting in low modification rates
|        - poor reverse transcription conditions, resulting in
|          low adduct read-through
|        - extremely highly structured RNA

Second, regarding the alignment, I don't think I am seeing the same issue that you are. It is easy to be distracted by the alignment rates of the PAIRED reads, which only represent a few percent of your samples. I've removed that information here to highlight the actual total alignment rates for each sample.

Edit: "Paired reads" is misleading here. ShapeMapper2 first performs read merging, then alignment. During read merging, R1 and R2 are combined into a single read, and passed to bowtie as an "unpaired" fasta file, "paired reads" are passed to bowtie as unmerged R1 and R2 files. Having a high percentage of "unpaired" reads is a good thing.

  |BowtieAligner (sample: Denatured) output message: 
  |------------------------------------------------- 
  | ...
  | 97.95% overall alignment rate 

  |BowtieAligner (sample: Untreated) output message: 
  |------------------------------------------------- 
  | ...
  | 98.23% overall alignment rate

  |BowtieAligner (sample: Modified) output message: 
  |------------------------------------------------ 
  | 98.51% overall alignment rate
Shashankti commented 8 months ago

Thank you so much for the clarification. I misunderstood the alignment stats. Can you confirm that having overrepresented sequences after the filtering and merging is standard behavior, because I was not able to see that in the example run files.

Thanks

Psirving commented 8 months ago

In short, the more informative warning is the one you are getting from ShapeMapper2. The low mutation rates in your treated sample are the problem.

From FastQC documentation:

Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

However, this also means that there may be chemical-adduct induced mutations which FastQC does not see, causing the program to overreport non-duplicated sequences. I would expect that ~90% of a single sequence is typical for an amplicon experiment with low mutation rates.

Shashankti commented 8 months ago

Thanks again for the help.