ewmorr commented 6 months ago

Dear itsxpress authors,

I'm reporting a potential bug/unexpected behavior in itsxpress paired-end (PE) read mode. Namely, after processing PE reads with itsxpress and dada2, some ASV representative sequences retain conserved flanking regions, i.e., conserved regions are detected by ITSx.

First I'll give the general outlines of the problem, and then describe my workflow. I'm sorry for the very long post/explanation, but I wanted to be thorough in explaining my concerns. I expect this behavior could be replicated with many/most Illumina 250 bp PE ITS datasets targeting either ITS1 or ITS2, but I'd be happy to provide some representative data if it helps.

Bug and potential fix

I believe the issue arises from trimming only the 5-prime end of each read and not trimming the three-prime end of each read in PE read mode. From _map_func() in Dedup.py (L100 is the culprit)

https://github.com/USDA-ARS-GBRU/itsxpress/blob/0de1542a265a78c562c14cb9eaa63862cf79bc03/itsxpress/Dedup.py#L98-L100

The issue is when there is read-through on the 3-prime end of either read that runs into the conserved region. Here is a graphical example of what I think is happening.

Reads before trimming -- showing 3' read-through into conserved regions

        5.8S                    ITS2                     28S

R1 |----------------|-------------------------------|------------> 
       <------------|-------------------------------|----------------------| R2

Reads after trimming with itsxpress -- 5' read ends are trimmed but not 3' ends

        5.8S                    ITS2                     28S

R1                  |-------------------------------|------------>
       <------------|-------------------------------| R2

itsxpress trims the conserved region from the 5-prime end of each read, but conserved regions are retained on the 3-prime end. DADA2 defaults to allow staggered reads at the merge step (at least in the R version), so the conserved region ends up getting incorporated back into the representative sequence for the ASV.

Resulting ASV

        5.8S                    ITS2                     28S
       |------------|-------------------------------|------------|

Note: Different software has different default behaviors, e.g., vsearch --allow_mergestagger silently trims the overhang from 3-prime end, so these leftover conserved regions would not show up in vsearch pipelines. To me it seems the range of potential pipelines and different default behaviors causes more problems than not as we can not expect any standardization in the resulting representative sequence unless this is handled appropriately by itsxpress (i.e., changing the behavior of a downstream algorithm is not an effective solution).

Potential fix to `_map_func()`

Calculate the stop site of the reverse read, here r2end, and then slice the read using both the start and stop sites. Modifications are on the last two lines

start, stop, tlen = itspos.get_position(repseq)
r2start = tlen - stop
r2end = tlen - start #calculate end of R2
return record1[start:stop], record2[r2start:r2end] #trim both R1 and R2 three-prime ends based on stop sites

I went ahead and incorporated the suggested code above into a local copy of itsxpress. After processing reads with the modified itsxpress version and following the same dada2 protocol I no longer detect conserved regions with ITSx in the resulting ASV sequences. More details of my pipeline and the results follow.

Testing and results

I have NovaSeq 250 bp PE reads derived from 5.8S-Fun/ITS4-Fun primers sequenced in reverse direction (i.e., R1 adaptor is on ITS4-Fun and R2 on 5.8S-Fun).

itsxpress

I processed my sequences through itsxpress v2.x (installed via mamba install -c bioconda itsxpress on Feb 12, 2024).
I also processed sequences through itsxpress using a local git clone where I made the modification to _map_func() suggested above; installed the dependecies via mamba install -c bioconda itsxpress --only-deps and installed my mods via pip install --no-build-isolation --no-deps -e .
```
itsxpress --fastq $indir/$r1File --fastq2 $indir/$r2File \
--reversed_primers --threads 24 --cluster_id 1 --taxa Fungi --region ITS2 \
--outfile $outdir/$r1File --outfile2 $outdir/$r2File
```
Subsequent algorithms were run separately on either the sequences processed with the current itsxpress release or my modified version.

dada2

I proceeded with the dada2 workflow in R as described here starting with the filterAndTrim() step.

Note: there is a graphical example of the same read-through issue in the preamble of the dada2 ITS protocol, but the primer sequence location is not indicated as occurring in conserved flanking regions as is the case with most ITS primers.

ITSx

I then ran the resulting ASV sequences through ITSx v1.1.1 using either strict or relaxed search criteria and the HMMs database from a local git clone of itsxpress.

Note: the ITSx manual suggests using relaxed stringency when sequences are known to be Fungi; given that we have already processed reads through itsxpress and reads with no detections are discarded by itsxpress we should be OK with the relaxed settings.
```
ITSx -i ASVs.fa -o out_itsx_relaxed \
-t F --allow_single_domain 0.01,10 \
-E 0.01 --cpus 24 \
-p ~/repo/itsxpress_2x_mod/itsxpress/ITSx_db/HMMs
```

ITSx -i ASVs.fa -o out_itsx_strict \ -t F --allow_single_domain 0.00001,10 \ -E 0.00001 --cpus 24 \ -p ~/repo/itsxpress_2x_mod/itsxpress/ITSx_db/HMMs


### vsearch 
As another test I tried merging the output from itsxpress directly with vsearch to gauge the number of reads that failed merging due to staggered reads

vsearch --fastq_mergepairs $outdir/$r1File --reverse $outdir/$r2File \ --fastaout $outdir/$sample.fasta


## Results

### itsxpress 2.x (current release)
- Number ASVs: 3480
- ASV length bp (min, max): 183, 280
- Number detections ITSx: 12 stringent, 54 relaxed
- vsearch: large number of reads (80% and greater per sample) fail due to "staggered read pairs"
    - e.g., in one sample 6.2% merged, 93.8% not merged; 264491 of 282079 fail due to staggered read pairs

### itsxpress 2.x modified to trim three-prime read-through
- Number ASVs: 3355
- ASV length bp (min, max): 51, 280
- Number detections ITSx: 0 stringent or relaxed
- vsearch: 100% merged
    - e.g., in same sample as above 72 of 282079 fail due to multiple potential alignments or too many difs; all others pass

It is a small proportion of ASVs that retain ITSx-detectable conserved regions, but this will of course be dataset dependent and is really dependent on the length of the targeted ITS region. Further, given the large number of reads that fail vsearch merging it seems possible we are missing a lot of shorter chunks of the conserved regions with ITSx that are difficult to detect after trimming the 5-prime part of the reads with itsxpress.
#### For me, particularly concerning is the increased number of ASVs using the current release versus the modified version. I think what this suggests is that variable lengths of read-through are causing formation of unique ASVs from what would otherwise be considered a single ASV (based on only considering the ITS region). 
For example consider the following:

### Reads before trimming. Two identical ITS2 sequences with different read-through lengths

    5.8S                    ITS2                     28S

R1 |----------------|-------------------------------|-------------------> <------------|-------------------------------|----------------------| R2

R1 |----------------|-------------------------------|----------> <------|-------------------------------|----------------------| R2

### After trimming 5' only, two identical sequences based on ITS look like different ASVs because of variable length read-through

    5.8S                    ITS2                     28S

R1 |-------------------------------|-------------------> <------------|-------------------------------| R2

R1 |-------------------------------|----------> <------|-------------------------------| R2


This could occur when, e.g., different read quality on trailing ends causes truncation at different lengths.

arivers commented 6 months ago

Hi Eric,

Thanks for letting us know. This has got to be the best bug report I've ever seen. We are working on writing unit tests to thoroughly replicate the issue and validate the proposed solution on a range of different read types. We anticipate a code update in 1-2 days, including information on the versions where the behavior occurs. Staggered read merging and the use of vsearch for merging are new so I think this issue was introduced recently. I think, setting the trunc_len_f and trunc_len_r parameters in Dada2 to less than the anticipated ITS region length ( a common practice) would prevent the creation of extra ASVs, but I have not tested that yet. In any event, we want every forward and reverse read trimmed to the selected ITS region only.

arivers commented 6 months ago

This issue was fixed with the release of version 2.0.2.

Here's a detailed explanation from the change log:

Fixed a bug where the 3' end of the ITS region was not being trimmed from both forward and reverse reads if the read extended past the ITS region. This was due to the trimming being done at the start of both forward and reverse reads and not the end of each read. Thus if the read overlapped the opposite end of the ITS read, part of the conserved region would still be found on the ends of the forward and reverse read. This was fixed by trimming to just the ITS region for both forward and reverse reads. We tested and verified this bug did not affect the results of ASV calling with Dada2 because Dada2 ignored the sequence beyond the ITS region. This fix will make the output more consistent with expectations.
Added unit test to confirm that the 3' end of the ITS region is being trimmed from both forward and reverse reads.

ewmorr commented 5 months ago

Wonderful, thanks for fixing this! Just a note: according to the dada2 author overhang from staggered reads is not trimmed during mergePairs unless trimOverhang = T (default is false) and overhangs potentially affect the denoising step, which is presumably why I was seeing differences in dada2 results with or without 3' trimming in itsxpress. See discussion here. In any case, this fix will correct that, thanks!

USDA-ARS-GBRU / itsxpress

Conserved flanking regions are not fully trimmed from some sequences in paired-end mode #43

Bug and potential fix

Reads before trimming -- showing 3' read-through into conserved regions

Reads after trimming with itsxpress -- 5' read ends are trimmed but not 3' ends

Resulting ASV

Potential fix to `_map_func()`

Testing and results

itsxpress

dada2

ITSx

USDA-ARS-GBRU / itsxpress

Conserved flanking regions are not fully trimmed from some sequences in paired-end mode #43

Bug and potential fix

Reads before trimming -- showing 3' read-through into conserved regions

Reads after trimming with itsxpress -- 5' read ends are trimmed but not 3' ends

Resulting ASV

Potential fix to _map_func()

Testing and results

itsxpress

dada2

ITSx

Potential fix to `_map_func()`