split_libraries_fastq.py -q, -r, and -p option documentation unclear

jairideout commented 9 years ago

From this forum post, I think the option documentation for split_libraries_fastq.py's -q/--phred_quality_threshold, -r/--max_bad_run_length, and -p/--min_per_read_length_fraction could be improved for clarity:

-q/--phred_quality_threshold: current docs say it is the maximum unacceptable Phred quality score (e.g., for Q20 and better, specify -q 19). It isn't clear how this value is actually used (it's only used with -r/--max_bad_run_length).
-r/--max_bad_run_length: current docs say it is the max number of consecutive low quality base calls allowed before truncating a read. It isn't clear that -q defines "low quality".
-p/--min_per_read_length_fraction: current docs say it is the min number of consecutive high quality base calls to include a read (per single end read) as a fraction of the input read length. It isn't clear what defines "high quality" for this option, and I think the current description may be incorrect because the relevant code shows that truncation happens via -q and -r first, then the trimmed read length is checked against -p. If the read contains "low quality" bases that aren't trimmed by -r these would still exist in the sequence, so it isn't guaranteed to have some minimum number of consecutive high quality base calls.

@gregcaporaso @walterst does my interpretation of these options make sense?

walterst commented 9 years ago

Jai, I agree with your assessment.

Perhaps these parameters could be updated along these lines:

-q PHRED_QUALITY_THRESHOLD, --phred_quality_threshold=PHRED_QUALITY_THRESHOLD the maximum unacceptable Phred quality score (e.g., for Q20 and better, specify -q 19). Utilized with the -r/--max_bad_run_length parameter for truncation of low quality bases. [default: 3]

-r MAX_BAD_RUN_LENGTH, --max_bad_run_length=MAX_BAD_RUN_LENGTH max number of consecutive low quality base calls, as specified with -q/--phred_quality_threshold, allowed before truncating a read [default: 3]

-p MIN_PER_READ_LENGTH_FRACTION, --min_per_read_length_fraction=MIN_PER_READ_LENGTH_FRACTION min length of sequence required to retain read after quality truncation via -r/--max_bad_run_length parameter. Value is a fraction of the input read length [default: 0.75]

alk224 commented 9 years ago

As a sequencer operator, I strongly disagree with a default setting of 3 for -q. I know this is the value in Bokulich 2013 which is an important consideration, but given that phred scale is logarithmic (q10 is 10% error, q20 is 1% error, q30 0.1% error), and most use at least q20 for genomic work, I suggest a default of q20.

Usually when the quality takes a dump during a run you see a bunch of base calls at q15 or less meaning that a default at q3 will retain a lot of poor quality bases including completely bogus reads (which can happen when sequencing a dirty pool).

gregcaporaso commented 9 years ago

I agree with the comments about the help text. I think @alk224 may be right that we want to update that default at some point, but that is a separate issue from what is being discussed here.

jairideout commented 9 years ago

Thanks @walterst, these new descriptions are much clearer!

The only thing unclear to me is Value is a fraction of the input read length. The code only considers the first read in the file to determine what the actual minimum length is (since -p is relative). It'd be nice to describe this behavior since the current wording implies the minimum length is determined from each read before it is truncated.

I'm not sure what the desired behavior should be -- it seems safer/better to compute the minimum length on a per-read basis (so compare the truncated sequence's length to its original length). Was there a specific reason for the current behavior?

alk224 commented 9 years ago

Yes, that was slightly off topic, but I am extending an email conversation that went to the forum and then was brought here. Thanks for the description updates.

The minimum length calculation absolutely should be per read. I am pretty surprised to learn that's not how it has been which may explain some anomalous split_libraries output from length polymorphic loci such as ITS (bias against longer fragments).

walterst commented 9 years ago

Hello Jai,

I think the original design of calculating the sequence fraction length only once because reads should all have the same number of nucleotides when just looking at a single-end read.

With stitched reads, the lengths can vary (although one would hope it wouldn't vary wildly), so it could be worth calculating on a per-read basis.

jairideout commented 9 years ago

Thanks @walterst and @alk224! @gregcaporaso do you agree with this change? (calculating the read length on a per-read basis)

gregcaporaso commented 9 years ago

Yep, I agree - the reason for it not being on a per-read basis was b/c this filtering was written before we were joined paired end reads.

jairideout commented 9 years ago

Sounds good!

alk224 commented 9 years ago

Perfect

ETaSky commented 7 years ago

Hello, I was wondering if the behaviour of -p has been adjusted according to these comments in QIIME-1.9.1? I am having some weird results (filtering out a lot of sequences of being too short). Thanks! Jincheng

alk224 commented 7 years ago

It has not. Someone in the QIIME development team can speak to how to address this issue in QIIME 2 which will be released soon.

Impulsively, it sounds like your read quality might not be great across the length of the read. You should run fastqc on your raw fastq data to check it. If necessary, you can use fastx_trimmer (see fastx toolkit website for command usage) to remove junk data at the 3' end.

-- Andrew Krohn PhD Candidate, NAU Biological Sciences Core Facility Manager, EnGGen

On Mon, May 22, 2017 at 1:04 PM, ETaSky notifications@github.com wrote:

Hello, I was wondering if the behaviour of -p has been adjusted according to these comments in QIIME-1.9.1? I am having some weird results (filtering out a lot of sequences of being too short). Thanks! Jincheng

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303205240, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDNZnEN2TRwni_ZbkTXDwIdE4TwBg-Dks5r8epKgaJpZM4EzT5W .

jairideout commented 7 years ago

@ETaSky the behavior in QIIME 1.9.1 is what is described here -- I created the issue to clarify how the parameters are actually working (and how they interact with each other).

QIIME 2 is available as alpha software, see this page for determining when to start using QIIME 2. Check out the docs for getting started with QIIME 2, installing, tutorials, etc. Filtering similar to split_libraries_fastq.py is available in the q2-quality-filter plugin -- there's some usage examples in this section of the Moving Pictures tutorial. If you need help with any of the QIIME 2 software, please post to the QIIME 2 forum.

If you need further support with the quality filtering in split_libraries_fastq.py, can you please post your question on the QIIME 1 forum?

ETaSky commented 7 years ago

Thanks, @alk224 and @jairideout! Regarding the read quality, I actually have checked a test sample, after the merging, 99% of the reads have Q20 or higher. So it is not about quality. The problem I identified is the sequence length. A large portion of the reads are around 168 bps (V4 region, which should be 253), only a few portion are around right length. But since 253 bp may be picked by QIIME to gauge the sequence length, a lot of reads are being filtered out. Thanks!

alk224 commented 7 years ago

QIIME will choose 253 only if the first read in the multiplexed fastq is this long.

Curious about the shorter reads. This may have happened if you had small PCR artifact during amplification. Otherwise maybe the merge settings? Personally I like fastq-join but I think they all work about the same. You might try a different merge program and see if the results are equivalent.

-- Andrew Krohn PhD Candidate, NAU Biological Sciences Core Facility Manager, EnGGen

On Mon, May 22, 2017 at 2:56 PM, ETaSky notifications@github.com wrote:

Thanks, @alk224 https://github.com/alk224 and @jairideout https://github.com/jairideout! Regarding the read quality, I actually have checked a test sample, after the merging, 99% of the reads have Q20 or higher. So it is not about quality. The problem I identified is the sequence length. A large portion of the reads are around 168 bps (V4 region, which should be 253), only a few portion are around right length. But since 253 bp may be picked by QIIME to gauge the sequence length, a lot of reads are being filtered out. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303230873, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDNZubAWXdrOlKhWlk3UsiIcfHkqEQYks5r8gSFgaJpZM4EzT5W .

walterst commented 7 years ago

Are these reads stitched? If so, do you get the same number of discarded reads if you use the unstitched reads 1 as input instead?

On Tue, May 23, 2017 at 12:01 AM, alk224 notifications@github.com wrote:

QIIME will choose 253 only if the first read in the multiplexed fastq is this long.

Curious about the shorter reads. This may have happened if you had small PCR artifact during amplification. Otherwise maybe the merge settings? Personally I like fastq-join but I think they all work about the same. You might try a different merge program and see if the results are equivalent.

-- Andrew Krohn PhD Candidate, NAU Biological Sciences Core Facility Manager, EnGGen

On Mon, May 22, 2017 at 2:56 PM, ETaSky notifications@github.com wrote:

Thanks, @alk224 https://github.com/alk224 and @jairideout https://github.com/jairideout! Regarding the read quality, I actually have checked a test sample, after the merging, 99% of the reads have Q20 or higher. So it is not about quality. The problem I identified is the sequence length. A large portion of the reads are around 168 bps (V4 region, which should be 253), only a few portion are around right length. But since 253 bp may be picked by QIIME to gauge the sequence length, a lot of reads are being filtered out. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303230873, or mute the thread https://github.com/notifications/unsubscribe-auth/ AEDNZubAWXdrOlKhWlk3UsiIcfHkqEQYks5r8gSFgaJpZM4EzT5W .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303231887, or mute the thread https://github.com/notifications/unsubscribe-auth/ACbkNLN_VIfessL67O-3bRGdOljrKC4wks5r8gWggaJpZM4EzT5W .

-- Postdoctoral Researcher Ley Lab Department of Microbiome Science Max Planck Institute for Developmental Biology 72076 Tübingen, Germany For more information visit the Ley Lab Website: www.leylab.com

ETaSky commented 7 years ago

@alk224 Thanks! Yes, the first few sequences in the file seem to have a length of 253 bp. I suspect I had small PCR artefacts. I was not the one who did the library preparation, just got assigned to do the data analysis. I used fastq-join in QIIME. But I tried BBMerge and FLASH, the results are similar.

@walterst The Paired-ends have already been merged.

alk224 commented 7 years ago

Echo Tony's suggestion to check read 1 without merging the reads. That is excellent advice.

-- Andrew Krohn PhD Candidate, NAU Biological Sciences Core Facility Manager, EnGGen

On Tue, May 23, 2017 at 12:06 PM, ETaSky notifications@github.com wrote:

@alk224 https://github.com/alk224 Thanks! Yes, the first few sequences in the file seem to have a length of 253 bp. I suspect I had small PCR artefacts. I was not the one who did the library preparation, just got assigned to do the data analysis. I used fastq-join in QIIME. But I tried BBMerge and FLASH, the results are similar.

@walterst https://github.com/walterst The Paired-ends have already been merged.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303500805, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDNZiu5FgdPsHOdWp-yj0g1zNW9T7GGks5r8y5JgaJpZM4EzT5W .

ETaSky commented 7 years ago

@alk224 Thanks! I checked by using just Read 1. But the results are really hard to tell. This dataset has around 4 million reads, and about 1 million were filtered out for being too short, which is about 800K reads less than filtering on the merged reads. But to be honest, I don't quite follow why dosing this test would tell anything.

alk224 commented 7 years ago

There is some issue with your merge. Possibly due to quality, I would guess in read 2. This thread isn't really for helping you through it so this will be my last post, but you should check the read quality with fastqc. If your read 2 is rather trashy, you might not want to use it. Also run fastqc on the merged reads and see where the quality starts to tank. Use fastx_trimmer to lop your reads before the quality dives to retain the most reads.

Check out Schirmer et al 2015 Nucleic Acids Research for some data on the order of read filtering and merging and possible denoising.

-- Andrew Krohn PhD Candidate, NAU Biological Sciences Core Facility Manager, EnGGen

On Tue, May 23, 2017 at 1:59 PM, ETaSky notifications@github.com wrote:

@alk224 https://github.com/alk224 Thanks! I checked by using just Read

But the results are really hard to tell. This dataset has around 4 million reads, and about 1 million were filtered out for being too short, which is about 800K reads less than filtering on the merged reads. But to be honest, I don't quite follow why dosing this test would tell anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/qiime/issues/2038#issuecomment-303530602, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDNZpB-rGRQAo-62rw687TdurwxKxQtks5r80jPgaJpZM4EzT5W .

ETaSky commented 7 years ago

@alk224 Thanks! And sorry for off the topic unintentionally. Please delete my irrelevant post, if necessary.

biocore / qiime

split_libraries_fastq.py -q, -r, and -p option documentation unclear #2038