USDA-ARS-GBRU / itsxpress

Software to trim the ITS region of FASTQ sequences for amplicon sequencing analysis
Other
12 stars 9 forks source link

ITSxpress detecting 3 ITS2 loci, same file ITSx detects 200000 ITS2 loci #18

Closed Andreas-Bio closed 2 months ago

Andreas-Bio commented 4 years ago

combined_seq_412.fastq.gz /home/ubuntu/miniconda3/bin/itsxpress --fastq /home/ubuntu/combined_seq_412.fastq.gz --single_end --outfile /home/ubuntu/ITSxpress/combined_seq_ITS2_T_412.fastq.gz --region ITS2 --taxa Tracheophyta --cluster_id 1 --threads 10 produces this result: combined_seq_ITS2_T_412.fastq.gz

but ITSx produces this result: /home/ubuntu/ITSxpress/ITSx_1.1.2/ITSx -i /home/ubuntu/combined_seq_412.fasta -o /home/ubuntu/ITSxpress/412 --save_regions ITS2 --minlen 60 --not_found F --graphical F --cpu 28 --complement F -t T --reset T ITSx412.tar.gz

arivers commented 4 years ago

I'm not sure of the issue offhand. Could you include the log file that was generated and the version number or commit hash number if you pulled from source? If you have paired end data you could try that too, most users use paired end data so that has been better field tested.

Andreas-Bio commented 4 years ago

The .log does not show anything suspicious. ITSxpress.log

It works with other samples, so it is not a general issue. I haven' t found out yet why sometimes it failes and sometimes the results are perfect. I checked the sequences and the labels, they seem to be okay. My version is 1.8.0 installed via conda. The .hmm have the same byte size as the .hmm files in the ITSx directory.

arivers commented 4 years ago

It may be a parsing issue. I see you linked combined_seq_412.fastq.gz I can run it and take a look tomorrow.

arivers commented 4 years ago

Okay, I ran it and looked at the results. ITSxpress is designed to trim complete ITS regions (ITS1, ITS2, or the complete ITS region containing ITS1 the 5.8s and ITS2) It requires that the beginning and the end be present to output the ITS. HMMER looks for the edges of the ITS2 in the 5.8S or the LSU by looking by 20-30 BP to identify the edges. In this library, no ends of the ITS2 at the junction of the LSU were detected.

So this isn't really an error but a result of the reads being too short to detect an end to the ITS2. I would guess ITSx has a mode that only trims the front of the ITS reads. The main use case for ITSxpress has been as input for calling amplicon sequence variants with Dada2. That requires that the full amplicon be present for accurate variant calling.

Andreas-Bio commented 4 years ago

Thank you for your time! If it was designed this way ITSxpress is not compatible with all primers. There are no LSU edges detected because the primers from https://www.nature.com/articles/s41598-018-26648-2 are very close to ITS2, leaving no LSU overhang after primer removal. (Which is commonly done as the first step in amplicon studies.) I believe this should be communicated more clearly, as it is definetly a very important difference from ITSx. An easy fix would be to add the primers back after quality filtering, giving them a perfect score (as they are removed by ITSxpress anyway). I am sorry I can not be more positive, your support was outstanding.

arivers commented 4 years ago

Interesting. There's no better way to learn about all the unique ways that universal primer sets are configured to write a tool to trim them. I'll think about how to support this. I could add a flag to ignore end trimming with a warning not to do it normally. The reads in your example do not contain the primer sequence and the sequence is degenerate, so stitching it back on seems pretty tricky. I'd have to think about how to handle it on primer sets with '--reversed-primers'. Do you have input on the change?

Andreas-Bio commented 4 years ago

If I understood that right ITSxpress is only extracting whole ITS regions. I am not sure how you enforce that rule, but ITSx has a flag called --only_full {T or F} that could be used for that. Maybe it would be possible to implement this flag in ITSxpress with minimal effort, becuse the two tools seem to be very similar? Then if the flag is TRUE per default, the behaviour of ITSxpress would not change for long-term users. On the other hand, if the primer is too close to ITS2 or if you sequence a very noisy amplicon (high GC) and partial sequences are desired the user could turn the flag to FALSE.

arivers commented 2 months ago

I ran this today and confirmed that ITSxpress v2.1.1 returns the same results. At this point I don't have plans to add support for primer sets that partially overlap the ITS region and partially overlap the conserved SSU or LSU. I'll. try to monitor the need though and may put it on the roadmap in the future.