Closed TobiasJu closed 5 years ago
Hi Tobias,
there are two quite simple ways to find out which bases/regions have been corrected:
1. Uncorrected stuff that hasn't even seen a single Illumina mapped onto it is lower-case. Everything with at least a coverage of 1 in short reads is uppercase. So just comparing uppercase_bp/total_bp gives you a quick estimate sorry, my mistake - this doesn't true in current versions
The only issues to keep in mind with these metrics of % correced is that you can't really directly relate the number of bases in the raw reads to the corrected reads, because correction of pacbio reads often means deleting/inserting bases, and thus changing the overall length of the read. Simple example, if you were to correct every single base in a raw read with 1000 bp, the read would on average shrink by ~%5 in length, because PacBio has 10% insertions and 5% deletions, so the corrected reads would only be 950bp in length. Hence, a naive estimate of corrected_bp/raw_bp would give you a %-corrected of 95% (950bp/1000bp), which would be quite wrong in this example.
I also want to know the position of corrected fragments in reads and the percentage of corrected bases in correction_PacBio.trimmed.fa. In my understand about correction_PacBio.trimmed.fa file, for example, the header "SUBSTR:0,2134" include the position(0,2134) and the length(2034) of corrected part. Is it true? If false, how to get the position and percentage of corrected part? Besides, correction_PacBio.trimmed.fa I got didn't contain the lower-case bases, just uppercase_bp. Thank you so much!
Sorry, I was wrong about the lower-case option - I corrected it in the comment above.
Regarding "corrected bases in .trimmed.fa." - how exactly are you defining "corrected base" - only bases that have been modified, or everything that had Illumina reads mapped to it. If the latter, then 100% of the bases in .trimmed.fa are corrected.
Regarding the position of the fragments: I'm assuming you are interested in the position relative to the uncorrected read - that, unfortunately, is not as trivial as one might think. Mostly, because the during the correction bases not only get modified, they also get inserted and deleted, hence relative positions of individual bases slightly change between the uncorrected and corrected read. I've tried to explain it here in other threads before: https://github.com/BioInf-Wuerzburg/proovread/issues/71 and https://github.com/BioInf-Wuerzburg/proovread/issues/115. In general, you best option for getting the location of corrected fragment in the uncorrected reads, ist to align the corrected fragments back to the uncorrected read. Let me know if you need more information than that.
Thank you for your reply! I undeerstand mostly. I define "corrected base" as bases that have been modified compared with uncorrected bases. My intent is to get the all position of correted fragment in RAW reads,ie the location of erronrous fragment. Regarding "corrected bases in .trimmed.fa",I have already known the meanings of "SUBSTR:0,548"indicates the coordinates of the first cutting operation in perl-style substr notation: start (0-offset), length from #115. However, I have a little question that why the start position of piece ist in turn, just like "c1012_f1p4_3879.1 SUBSTR:2078,1717 and c1012_f1p4_3879.2 SUBSTR:24,1715" . In my opnion, "SUBSTR:24,1715" should be in "c1012_f1p4_3879.1" dut to starting at 24, while "SUBSTR:2078,1717" in "c1012_f1p4_3879.2" due to starting at 2078. Is it Ture? thank you so much!
If I remember correctly, the pieces are numbered starting from the longest. In the above case, the second piece is longer.
Hello Thomas,
is there a way to see the percentage of corrected bases in the .untrimmed.fq file? Or how do i know which bases are corrected from proovread?
I read your explanation about the Output and searched the Issues, like this one. Is there an easier way, than to map all the raw PacBio reads to the corrected (untrimmed) reads and count the difference?
thanks