USDA-ARS-GBRU / itsxpress

Software to trim the ITS region of FASTQ sequences for amplicon sequencing analysis
Other
12 stars 9 forks source link

ITS produces empty files after merging - No ITS start or stop sites detected #24

Closed science-chump closed 7 months ago

science-chump commented 2 years ago

Hey all, I am having trouble getting ITSexpress to work on my files. The sequences were amplified using ITS4Fun and 5.8S primers to capture the ITS2 region.

The merging step seems to be working okay because I am generating data on the %merged reads and read lengths and what not, however my sequences seem to be erroring out and I am getting a message that says no ITS start or stop sites detected. This error line repeats for many many lines.

Any insight would be greatly appreciated!

Here is the code I am using:

`conda activate trim_3p

INDIR=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_F INDIR2=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_R OUTDIR=ITSxpress_f OUTDIR2=ITSexpress_r mkdir $OUTDIR mkdir $OUTDIR2

for i in $INDIR/R1 do( FILE=${i##/} BEFFILE=${FILE%R1} AFTFILE=${FILE##*R1} R1=$FILE R2=${BEFFILE}R2${AFTFILE} echo $R1 if [ -f $OUTDIR2/$R2 ] then continue fi

srun ~/.local/bin/itsxpress \
--fastq $INDIR/$R1 --fastq2 $INDIR2/$R2 \
--outfile $OUTDIR/$R1 --outfile2 $OUTDIR2/$R2 \
--region ITS2 --taxa 'Fungi' --cluster_id 1 \
--reversed_primers \
--threads 16 \
--log itsxpress.log

) done `

arivers commented 2 years ago

What primer set are you using?

On Mon, Nov 22, 2021, 9:44 AM srs1085 @.***> wrote:

Hey all, I am having trouble getting ITSexpress to work on my files. The sequences were amplified using ITS4Fun and 5.8S primers to capture the ITS2 region.

The merging step seems to be working okay because I am generating data on the %merged reads and read lengths and what not, however my sequences seem to be erroring out and I am getting a message that says no ITS start or stop sites detected. This error line repeats for many many lines.

Any insight would be greatly appreciated!

Here is the code I am using:

`conda activate trim_3p

INDIR=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_F

INDIR2=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_R OUTDIR=ITSxpress_f OUTDIR2=ITSexpress_r mkdir $OUTDIR mkdir $OUTDIR2

for i in $INDIR/R1 do( FILE=${i## /} BEFFILE=${FILE%R1} AFTFILE=${FILE##*R1} R1=$FILE R2=${BEFFILE}R2${AFTFILE} echo $R1 if [ -f $OUTDIR2/$R2 ] then continue fi

srun ~/.local/bin/itsxpress \ --fastq $INDIR/$R1 --fastq2 $INDIR2/$R2 \ --outfile $OUTDIR/$R1 --outfile2 $OUTDIR2/$R2 \ --region ITS2 --taxa 'Fungi' --cluster_id 1 \ --reversed_primers \ --threads 16 \ --log itsxpress.log

) done `

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/USDA-ARS-GBRU/itsxpress/issues/24, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACZ6VIIJRNWMXD5TSJDLHTUNJJOJANCNFSM5IRFWQWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

science-chump commented 2 years ago

ITS4fun and 5.8S from Taylor et al. 2016

arivers commented 2 years ago

Could you attach the log file?

science-chump commented 2 years ago

It is not letting me upload the .txt file but I will post some of it in here. Apologies if there is a better way to format all of this for the forum, I am a newbie.

It starts with this:

11-21 20:20 root         INFO     Verifying the input sequences.
11-21 20:20 root         INFO     Sequences are paired-end in two files. They will be merged using BBmerge.
11-21 20:20 root         INFO     java -ea -Xmx1000m -Xms1000m -Djava.library.path=/mnt/home/ernakovich/srs1085/.conda/envs/trim_3p/opt/bbmap-38.93-0/jni/ -cp /mnt/home/ernakovich/srs1085/.conda/envs/trim_3p/opt/bbmap-38.93-0/current/ jgi.BBMerge in=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_R/1d_9-8_ITS_S14_L002_R2_001.fastq.gz in2=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_F/1d_9-8_ITS_S14_L002_R1_001.fastq.gz out=/tmp/itsxpress_7zx0pbm5/seq.fq.gz t=16 maxmismatches=40 maxratio=0.3
Executing jgi.BBMerge [in=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_R/1d_9-8_ITS_S14_L002_R2_001.fastq.gz, in2=/mnt/home/ernakovich/srs1085/DATA/Rhizo_pilot/ITS_dada2/02_filter/preprocessed_F/1d_9-8_ITS_S14_L002_R1_001.fastq.gz, out=/tmp/itsxpress_7zx0pbm5/seq.fq.gz, t=16, maxmismatches=40, maxratio=0.3]
Version 38.93

Set threads to 16
Writing mergable reads merged.
Started output threads.
Total time: 6.156 seconds.

Pairs:                  250261
Joined:                 205989      82.310%
Ambiguous:              32945       13.164%
No Solution:            11327       4.526%
Too Short:              0           0.000%

Avg Insert:             340.0
Standard Deviation:     29.0
Mode:                   317

Insert range:           104 - 443
90th percentile:        385
75th percentile:        372
50th percentile:        323
25th percentile:        317
10th percentile:        315

11-21 20:20 root         INFO     Temporary directory is: /tmp/itsxpress_7zx0pbm5
11-21 20:20 root         INFO     Unique sequences are being written to a temporary FASTA file with Vsearch.
11-21 20:20 root         INFO     WARNING: The derep_fulllength command does not support multithreading.
Only 1 thread used.
vsearch v2.18.0_linux_x86_64, 995.5GB RAM, 64 cores
https://github.com/torognes/vsearch

Dereplicating file /tmp/itsxpress_7zx0pbm5/seq.fq.gz 100%
70045619 nt in 205989 seqs, min 104, max 443, avg 340
Sorting 100%
73002 unique sequences, avg cluster 2.8, median 1, max 8869
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

11-21 20:20 root         INFO     Searching for ITS start and stop sites using HMMSearch. This step takes a while.
11-21 20:22 root         INFO     Parsing HMM results.
11-21 20:22 root         INFO     Writing out sequences

Next is this line for at least a few hundred lines with the #'s changing

11-21 20:22 root DEBUG No ITS stop or start sites were identified for sequence A01346:32:HFMCNDRXY:2:2101:2139:1000, skipping.

Then ends with this: 11-21 20:23 root INFO ITSxpress ran in 00:03:17

arivers commented 2 years ago

Those lines are just informational not an error. After the merged, de-replicated seed sequence is created, itsxpress searches for the start and stop sites. Sometimes a merged sequence does not have the full sequence due to quality issues. If that seed sequence is missing you will get a warming for all other sequences in the de-replicated cluster. How many reads are you getting out in the end? Is it a reasonable number? Some loss is normal.

science-chump commented 2 years ago

The output sequences are all blank after they go through ITS express. No sequences are retained for any of the samples and they are all 1kb in size.

I don't know if this should matter or not but I used cutadapt and dada2 filtering prior to putting the samples through ITSexpress.

arivers commented 2 years ago

That may be the issue. The normal procedure is to remove adapters from your paired-end reads, then run ITSxpress. The output of ITSexpress goes into Dada2. I'm not sure what you mean by using Dada2 first. Dada2 primarily creates the ASV's and an ASV count table.

science-chump commented 2 years ago

My mistake, the pipeline I am using is written for dada2 but I am actually juxtaposing ITSxpress in it. The only steps that are occuring before the ITSxpress is removal of primer/adapters with cutadapt and removal of sequences with low quality bases.

science-chump commented 2 years ago

I have done a little bit of troubleshooting and have confirmed that my installation of ITSxpress was successful and that the BASH syntax is working correctly to locate my files.

I am stumped as to why ITSxpress is unable to locate ITS start or stop sites since they are all amplified with ITS4-FUN and 5.8S primers.

arivers commented 2 years ago

I helped another user with this issue last week and it turned out that his read 2 qualities were very low, and that was what was driving it, It may be worth looking at it and if necessary only running your forward reads through.

Note that that error can mean that the start site, stop site, start and stop site are missing. for a particular merged read. It does not necessarily mean that both were missing. you can save the temp dir with the --keep-temp flag then look at the intermediate file tblout.txt to figure that out.