USDA-ARS-GBRU / itsxpress

Software to trim the ITS region of FASTQ sequences for amplicon sequencing analysis
Other
12 stars 9 forks source link

q2-ITSxpress generates empty sequences? #29

Closed gabrielet closed 1 month ago

gabrielet commented 2 years ago

First things first, sorry for cross-posting: this same question was posted on the QIIME2 forum. I will keep both posts updated, of course.

So, I am using vsearch to perform dereplication on some SampleData[JoinedSequencesWithQuality] that are obtained through ITSxpress. Problem is that vsearch is returning an error stating that some sequences are empty (Found blank or whitespace-only line before '+' in FASTQ file). I actually looked into the sequences in the .qza from ITSxpress and I can find a lot of unexpected stuff, i.e. empty sequences and super short sequences. A small example here:

@M01168:5:000000000-A7C98:1:1101:10151:9983 1:N:0:2
AACCC
+
JJJJJ
@M01168:5:000000000-A7C98:1:1101:10155:15804 2:N:0:2
CACCAATCAAGCCTGGCTTGGTATTGGGCGACGGGGTGCACCCGCGCCTCAATTTCTCCGGCTGAACGACCACTATCTCAGCGCTGTGATAATCAATTCGCTGTCGAGACGGGTGCTCACGCCGTTAAAGATTTTATACA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@M01168:5:000000000-A7C98:1:2101:23914:6789 1:N:0:2

@M01168:5:000000000-A7C98:1:1101:10165:3002 1:N:0:2
TCAAC
+
JIJJJ
@M01168:5:000000000-A7C98:1:1101:10181:5305 1:N:0:2
TCAA
+
JJJJ

Now, I understand there was an issue with a previous version of ITSxpress back in 2018 but it looked like with newer releases of the plugin, it was solved. However, here I am. I also found this other post reporting same issue by vsearch but I believe it's not a vsearch issue but an ITSxpress-related issue.

I run the same data with the DADA2 plugin and no problem here (it was also suggested in the same post i already linked). However, I wanted to perform the analysis on the same samples using both OTU and ASV approach but currently I can't go through vsearch due to this issue. I am also considering the possibility that trimming ITS sequences and then dereplicating and clustering with vsearch shouldn't be done but then...why not?

It might be worth noting that I am using QIIME2 2021.8 with ITSxpress version 1.8.0 and q2-itsxpress version 1.8.0.

arivers commented 2 years ago

This sometimes happens when trimmed reads are fed to ITSxpress, Illumina sequencers now sometimes do read trimming on the machine, so it can happen without you knowing. In the past you always got 2x300 reads from a run. It looks like you took data processed by ITSxpress outside of QIime and imported it. The easiest way to fix the problem is to filter out the 0-length reads. You can use your favorite software to do this. With bbtools I think the command would be reformat.sh in=r1.fq.gz in2=r2.fq.gz out=r1_clean.fq.gz out2=r2_clean.fq.gz minconsecutivebases=1 (or minconsecutivebases= 20,50, whatever to throw away short read pairs).

The same issue happens with the Qiime plugin of ITSxpress. see the explanation here https://forum.qiime2.org/t/itsxpress-error-missing-sequence-for-record-beginning-on-line/24007/4

We are working on a Version 1.8.1 end-of-life release and a Versionv2 so this issue with trimmed reads is on the list.

gabrielet commented 2 years ago

@arivers thank you for the explanation. However, I feel I need to add something to what you mentioned. I first posted because I was using ITSxpress from within QIIME2: imported data, i.e. paired end sequences as CasavaOneEightSingleLanePerSampleDirFmt, then on the artifact I run ITSxpress. At this point, I found the empty sequence and I posted the question. In the meanwhile, I decided to run three different experiments:

  1. run ITSxpress standalone version on a single sample, i.e. the one with the empty sequence I reported above;
  2. run ITSxpress through QIIME2 as qiime itsxpress trim-pair on the same sample;
  3. run ITSxpress through QIIME2 as above, but on the full dataset comprising 36 samples.

What is very interesting is that I can find the empty sequence (@M01168:5:000000000-A7C98:1:2101:23914:6789 1:N:0:2) in the experiment n. 3, i.e. qiime2 ITSxpress on all samples, only. Indeed, in exp n. 1, i.e. ITSxpress standalone, and exp n. 2, i.e. qiime2 ITSxpress on single sample, I am unable to find the empty read. All the others I listed above are there, with same length, same quality. The empty one is missing.

Experiment 1: itsxpress --fastq 44TCJuneAD012F0604_CD18_L000_R1_001.fastq --fastq2 44TCJuneAD012F0604_CD18_L000_R2_001.fastq --region ITS2 --taxa Fungi --outfile

Experiment 2:

conda activate qiime2-2021.8

# import demultiplexed data
qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path provaITSx/fastq_for_single_sample_experiment/ \
    --input-format CasavaOneEightSingleLanePerSampleDirFmt \
    --output-path provaITSx/fastq_for_single_sample_experiment/one_sample.qza

# run itsxpress on a single sample
qiime itsxpress trim-pair \
    --i-per-sample-sequences provaITSx/fastq_for_single_sample_experiment/one_sample.qza \
    --p-region ITS2 \
    --p-taxa F \
    --o-trimmed provaITSx/fastq_for_single_sample_experiment/one_sample_ITS.qza

Experiment 3:

# import demultiplexed data
qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path analyses_fungi/experiments/merged_exps/cutadapted_selected \
    --input-format CasavaOneEightSingleLanePerSampleDirFmt \
    --output-path analyses_fungi/experiments/merged_exps/qiime/all_fungi.qza

# run itsxpress on all the demultiplexed samples
qiime itsxpress trim-pair \
    --i-per-sample-sequences analyses_fungi/experiments/merged_exps/qiime/all_fungi.qza \
    --p-region ITS2 \
    --p-taxa F \
    --o-trimmed analyses_fungi/experiments/merged_exps/qiime/all_fungi_ITS_extracted_vsearch_new_for_comparison.qza
arivers commented 1 month ago

Resolved in Version 2.