USDA-ARS-GBRU / itsxpress

Software to trim the ITS region of FASTQ sequences for amplicon sequencing analysis
Other
9 stars 9 forks source link

Conserved flanking regions are not fully trimmed from some sequences in paired-end mode #43

Closed ewmorr closed 6 months ago

ewmorr commented 6 months ago

Dear itsxpress authors,

I'm reporting a potential bug/unexpected behavior in itsxpress paired-end (PE) read mode. Namely, after processing PE reads with itsxpress and dada2, some ASV representative sequences retain conserved flanking regions, i.e., conserved regions are detected by ITSx.

First I'll give the general outlines of the problem, and then describe my workflow. I'm sorry for the very long post/explanation, but I wanted to be thorough in explaining my concerns. I expect this behavior could be replicated with many/most Illumina 250 bp PE ITS datasets targeting either ITS1 or ITS2, but I'd be happy to provide some representative data if it helps.

Bug and potential fix

I believe the issue arises from trimming only the 5-prime end of each read and not trimming the three-prime end of each read in PE read mode. From _map_func() in Dedup.py (L100 is the culprit)

https://github.com/USDA-ARS-GBRU/itsxpress/blob/0de1542a265a78c562c14cb9eaa63862cf79bc03/itsxpress/Dedup.py#L98-L100

The issue is when there is read-through on the 3-prime end of either read that runs into the conserved region. Here is a graphical example of what I think is happening.

Reads before trimming -- showing 3' read-through into conserved regions

        5.8S                    ITS2                     28S

R1 |----------------|-------------------------------|------------> 
       <------------|-------------------------------|----------------------| R2

Reads after trimming with itsxpress -- 5' read ends are trimmed but not 3' ends

        5.8S                    ITS2                     28S

R1                  |-------------------------------|------------>
       <------------|-------------------------------| R2

itsxpress trims the conserved region from the 5-prime end of each read, but conserved regions are retained on the 3-prime end. DADA2 defaults to allow staggered reads at the merge step (at least in the R version), so the conserved region ends up getting incorporated back into the representative sequence for the ASV.

Resulting ASV

        5.8S                    ITS2                     28S
       |------------|-------------------------------|------------|

Potential fix to _map_func()

Calculate the stop site of the reverse read, here r2end, and then slice the read using both the start and stop sites. Modifications are on the last two lines

start, stop, tlen = itspos.get_position(repseq)
r2start = tlen - stop
r2end = tlen - start #calculate end of R2
return record1[start:stop], record2[r2start:r2end] #trim both R1 and R2 three-prime ends based on stop sites

I went ahead and incorporated the suggested code above into a local copy of itsxpress. After processing reads with the modified itsxpress version and following the same dada2 protocol I no longer detect conserved regions with ITSx in the resulting ASV sequences. More details of my pipeline and the results follow.

Testing and results

I have NovaSeq 250 bp PE reads derived from 5.8S-Fun/ITS4-Fun primers sequenced in reverse direction (i.e., R1 adaptor is on ITS4-Fun and R2 on 5.8S-Fun).

itsxpress

dada2

I proceeded with the dada2 workflow in R as described here starting with the filterAndTrim() step.

ITSx

I then ran the resulting ASV sequences through ITSx v1.1.1 using either strict or relaxed search criteria and the HMMs database from a local git clone of itsxpress.

ITSx -i ASVs.fa -o out_itsx_strict \ -t F --allow_single_domain 0.00001,10 \ -E 0.00001 --cpus 24 \ -p ~/repo/itsxpress_2x_mod/itsxpress/ITSx_db/HMMs


### vsearch 
As another test I tried merging the output from itsxpress directly with vsearch to gauge the number of reads that failed merging due to staggered reads

vsearch --fastq_mergepairs $outdir/$r1File --reverse $outdir/$r2File \ --fastaout $outdir/$sample.fasta


## Results

### itsxpress 2.x (current release)
- Number ASVs: 3480
- ASV length bp (min, max): 183, 280
- Number detections ITSx: 12 stringent, 54 relaxed
- vsearch: large number of reads (80% and greater per sample) fail due to "staggered read pairs"
    - e.g., in one sample 6.2% merged, 93.8% not merged; 264491 of 282079 fail due to staggered read pairs

### itsxpress 2.x modified to trim three-prime read-through
- Number ASVs: 3355
- ASV length bp (min, max): 51, 280
- Number detections ITSx: 0 stringent or relaxed
- vsearch: 100% merged
    - e.g., in same sample as above 72 of 282079 fail due to multiple potential alignments or too many difs; all others pass

It is a small proportion of ASVs that retain ITSx-detectable conserved regions, but this will of course be dataset dependent and is really dependent on the length of the targeted ITS region. Further, given the large number of reads that fail vsearch merging it seems possible we are missing a lot of shorter chunks of the conserved regions with ITSx that are difficult to detect after trimming the 5-prime part of the reads with itsxpress.
#### For me, particularly concerning is the increased number of ASVs using the current release versus the modified version. I think what this suggests is that variable lengths of read-through are causing formation of unique ASVs from what would otherwise be considered a single ASV (based on only considering the ITS region). 
For example consider the following:

### Reads before trimming. Two identical ITS2 sequences with different read-through lengths
    5.8S                    ITS2                     28S

R1 |----------------|-------------------------------|-------------------> <------------|-------------------------------|----------------------| R2

R1 |----------------|-------------------------------|----------> <------|-------------------------------|----------------------| R2

### After trimming 5' only, two identical sequences based on ITS look like different ASVs because of variable length read-through
    5.8S                    ITS2                     28S

R1 |-------------------------------|-------------------> <------------|-------------------------------| R2

R1 |-------------------------------|----------> <------|-------------------------------| R2


This could occur when, e.g., different read quality on trailing ends causes truncation at different lengths.
arivers commented 6 months ago

Hi Eric,

Thanks for letting us know. This has got to be the best bug report I've ever seen. We are working on writing unit tests to thoroughly replicate the issue and validate the proposed solution on a range of different read types. We anticipate a code update in 1-2 days, including information on the versions where the behavior occurs. Staggered read merging and the use of vsearch for merging are new so I think this issue was introduced recently. I think, setting the trunc_len_f and trunc_len_r parameters in Dada2 to less than the anticipated ITS region length ( a common practice) would prevent the creation of extra ASVs, but I have not tested that yet. In any event, we want every forward and reverse read trimmed to the selected ITS region only.

arivers commented 6 months ago

This issue was fixed with the release of version 2.0.2.

Here's a detailed explanation from the change log:

ewmorr commented 5 months ago

Wonderful, thanks for fixing this! Just a note: according to the dada2 author overhang from staggered reads is not trimmed during mergePairs unless trimOverhang = T (default is false) and overhangs potentially affect the denoising step, which is presumably why I was seeing differences in dada2 results with or without 3' trimming in itsxpress. See discussion here. In any case, this fix will correct that, thanks!