Proovread output, percentage of corrected bases

TobiasJu commented 6 years ago

Hello Thomas,

is there a way to see the percentage of corrected bases in the .untrimmed.fq file? Or how do i know which bases are corrected from proovread?

I read your explanation about the Output and searched the Issues, like this one. Is there an easier way, than to map all the raw PacBio reads to the corrected (untrimmed) reads and count the difference?

thanks

thackl commented 6 years ago

Hi Tobias,

there are two quite simple ways to find out which bases/regions have been corrected: 1. Uncorrected stuff that hasn't even seen a single Illumina mapped onto it is lower-case. Everything with at least a coverage of 1 in short reads is uppercase. So just comparing uppercase_bp/total_bp gives you a quick estimate sorry, my mistake - this doesn't true in current versions

By quality: The fastq quality line primarily relates to the coverage the respective base had in the Illumina data. It depends a little on you consider "corrected", but essentially everything with a phred score higher than what you usually get for pacbio has been corrected. Have a look at https://en.wikipedia.org/wiki/Phred_quality_score for more info on phred scores. PacBios usually have phred scores <10 (<90 accuracy ). proovread assignes phred conservatively, and roughly like this: 7 for coverage == 1, 10 for cov==2 (90% accuracy), ..., 20 for cov==8 (>99%), 30 for cov==18 (99.9%), ... There are a few other things that impact quality, i.e. SNP sites where coverage for different alleles is lower than the surrounding average, but coverage is the primary driver.

The only issues to keep in mind with these metrics of % correced is that you can't really directly relate the number of bases in the raw reads to the corrected reads, because correction of pacbio reads often means deleting/inserting bases, and thus changing the overall length of the read. Simple example, if you were to correct every single base in a raw read with 1000 bp, the read would on average shrink by ~%5 in length, because PacBio has 10% insertions and 5% deletions, so the corrected reads would only be 950bp in length. Hence, a naive estimate of corrected_bp/raw_bp would give you a %-corrected of 95% (950bp/1000bp), which would be quite wrong in this example.

wangdepin commented 6 years ago

I also want to know the position of corrected fragments in reads and the percentage of corrected bases in correction_PacBio.trimmed.fa. In my understand about correction_PacBio.trimmed.fa file, for example, the header "SUBSTR:0,2134" include the position(0,2134) and the length(2034) of corrected part. Is it true? If false, how to get the position and percentage of corrected part? Besides, correction_PacBio.trimmed.fa I got didn't contain the lower-case bases, just uppercase_bp. Thank you so much!

thackl commented 6 years ago

Sorry, I was wrong about the lower-case option - I corrected it in the comment above.

Regarding "corrected bases in .trimmed.fa." - how exactly are you defining "corrected base" - only bases that have been modified, or everything that had Illumina reads mapped to it. If the latter, then 100% of the bases in .trimmed.fa are corrected.

Regarding the position of the fragments: I'm assuming you are interested in the position relative to the uncorrected read - that, unfortunately, is not as trivial as one might think. Mostly, because the during the correction bases not only get modified, they also get inserted and deleted, hence relative positions of individual bases slightly change between the uncorrected and corrected read. I've tried to explain it here in other threads before: https://github.com/BioInf-Wuerzburg/proovread/issues/71 and https://github.com/BioInf-Wuerzburg/proovread/issues/115. In general, you best option for getting the location of corrected fragment in the uncorrected reads, ist to align the corrected fragments back to the uncorrected read. Let me know if you need more information than that.

wangdepin commented 6 years ago

Thank you for your reply! I undeerstand mostly. I define "corrected base" as bases that have been modified compared with uncorrected bases. My intent is to get the all position of correted fragment in RAW reads,ie the location of erronrous fragment. Regarding "corrected bases in .trimmed.fa",I have already known the meanings of "SUBSTR:0,548"indicates the coordinates of the first cutting operation in perl-style substr notation: start (0-offset), length from #115. However, I have a little question that why the start position of piece ist in turn, just like "c1012_f1p4_3879.1 SUBSTR:2078,1717 and c1012_f1p4_3879.2 SUBSTR:24,1715" . In my opnion, "SUBSTR:24,1715" should be in "c1012_f1p4_3879.1" dut to starting at 24, while "SUBSTR:2078,1717" in "c1012_f1p4_3879.2" due to starting at 2078. Is it Ture? thank you so much!

thackl commented 6 years ago

If I remember correctly, the pieces are numbered starting from the longest. In the above case, the second piece is longer.

BioInf-Wuerzburg / proovread

Proovread output, percentage of corrected bases #134