Closed ewmorr closed 6 months ago
Hi Eric,
Thanks for letting us know. This has got to be the best bug report I've ever seen. We are working on writing unit tests to thoroughly replicate the issue and validate the proposed solution on a range of different read types. We anticipate a code update in 1-2 days, including information on the versions where the behavior occurs. Staggered read merging and the use of vsearch for merging are new so I think this issue was introduced recently. I think, setting the trunc_len_f
and trunc_len_r
parameters in Dada2 to less than the anticipated ITS region length ( a common practice) would prevent the creation of extra ASVs, but I have not tested that yet. In any event, we want every forward and reverse read trimmed to the selected ITS region only.
This issue was fixed with the release of version 2.0.2.
Here's a detailed explanation from the change log:
Fixed a bug where the 3' end of the ITS region was not being trimmed from both forward and reverse reads if the read extended past the ITS region. This was due to the trimming being done at the start of both forward and reverse reads and not the end of each read. Thus if the read overlapped the opposite end of the ITS read, part of the conserved region would still be found on the ends of the forward and reverse read. This was fixed by trimming to just the ITS region for both forward and reverse reads. We tested and verified this bug did not affect the results of ASV calling with Dada2 because Dada2 ignored the sequence beyond the ITS region. This fix will make the output more consistent with expectations.
Added unit test to confirm that the 3' end of the ITS region is being trimmed from both forward and reverse reads.
Wonderful, thanks for fixing this! Just a note: according to the dada2 author overhang from staggered reads is not trimmed during mergePairs
unless trimOverhang = T
(default is false) and overhangs potentially affect the denoising step, which is presumably why I was seeing differences in dada2 results with or without 3' trimming in itsxpress. See discussion here. In any case, this fix will correct that, thanks!
Dear itsxpress authors,
I'm reporting a potential bug/unexpected behavior in itsxpress paired-end (PE) read mode. Namely, after processing PE reads with itsxpress and dada2, some ASV representative sequences retain conserved flanking regions, i.e., conserved regions are detected by ITSx.
First I'll give the general outlines of the problem, and then describe my workflow. I'm sorry for the very long post/explanation, but I wanted to be thorough in explaining my concerns. I expect this behavior could be replicated with many/most Illumina 250 bp PE ITS datasets targeting either ITS1 or ITS2, but I'd be happy to provide some representative data if it helps.
Bug and potential fix
I believe the issue arises from trimming only the 5-prime end of each read and not trimming the three-prime end of each read in PE read mode. From
_map_func()
in Dedup.py (L100 is the culprit)https://github.com/USDA-ARS-GBRU/itsxpress/blob/0de1542a265a78c562c14cb9eaa63862cf79bc03/itsxpress/Dedup.py#L98-L100
The issue is when there is read-through on the 3-prime end of either read that runs into the conserved region. Here is a graphical example of what I think is happening.
Reads before trimming -- showing 3' read-through into conserved regions
Reads after trimming with itsxpress -- 5' read ends are trimmed but not 3' ends
itsxpress trims the conserved region from the 5-prime end of each read, but conserved regions are retained on the 3-prime end. DADA2 defaults to allow staggered reads at the merge step (at least in the R version), so the conserved region ends up getting incorporated back into the representative sequence for the ASV.
Resulting ASV
Potential fix to
_map_func()
Calculate the stop site of the reverse read, here
r2end
, and then slice the read using both the start and stop sites. Modifications are on the last two linesI went ahead and incorporated the suggested code above into a local copy of itsxpress. After processing reads with the modified itsxpress version and following the same dada2 protocol I no longer detect conserved regions with ITSx in the resulting ASV sequences. More details of my pipeline and the results follow.
Testing and results
I have NovaSeq 250 bp PE reads derived from 5.8S-Fun/ITS4-Fun primers sequenced in reverse direction (i.e., R1 adaptor is on ITS4-Fun and R2 on 5.8S-Fun).
itsxpress
mamba install -c bioconda itsxpress
on Feb 12, 2024)._map_func()
suggested above; installed the dependecies viamamba install -c bioconda itsxpress --only-deps
and installed my mods viapip install --no-build-isolation --no-deps -e .
Subsequent algorithms were run separately on either the sequences processed with the current itsxpress release or my modified version.
dada2
I proceeded with the dada2 workflow in R as described here starting with the
filterAndTrim()
step.ITSx
I then ran the resulting ASV sequences through ITSx v1.1.1 using either strict or relaxed search criteria and the HMMs database from a local git clone of itsxpress.
ITSx -i ASVs.fa -o out_itsx_strict \ -t F --allow_single_domain 0.00001,10 \ -E 0.00001 --cpus 24 \ -p ~/repo/itsxpress_2x_mod/itsxpress/ITSx_db/HMMs
vsearch --fastq_mergepairs $outdir/$r1File --reverse $outdir/$r2File \ --fastaout $outdir/$sample.fasta
R1 |----------------|-------------------------------|-------------------> <------------|-------------------------------|----------------------| R2
R1 |----------------|-------------------------------|----------> <------|-------------------------------|----------------------| R2
R1 |-------------------------------|-------------------> <------------|-------------------------------| R2
R1 |-------------------------------|----------> <------|-------------------------------| R2