andrewjpage / krocus

Predict MLST directly from uncorrected long reads
GNU General Public License v3.0
25 stars 6 forks source link

Question - how to correctly interpret the output file? #11

Closed BCArg closed 4 years ago

BCArg commented 5 years ago

Hi Andrew,

from what I understood in the methods section of your paper, krocus uses an iterative process to identify the alleles from k-mers from the raw reads. I am now struggling a bit to understand what each line in the output file means.

While following the generation of the output file with tail -f <out.txt> I noticed that all of the alleles had a partial k-mer coverage (marked with a star*) by the beggining of the analysis and also that the percentage of k-mers covered by reads was always going up. Seeing this as an iterative process, where more and more k-mers from the reads are matched against the allele k-mers this makes sense. I noticed this for my first analyses, so that I was always looking at the tail of the output file in order to check which allele was identified.

Then today I noticed that for a sample that we sequenced

  1. the initial pattern was observed i.e. increasing percentage of k-mers covered by the reads (column 2) and partial match for all the alleles (marked with a star*)
  2. At one point in the output file I got the correct ST identified. Below is an example of one line, say, in the middle of my output file 678 100.00 recA(7) fumC(6) gyrB(5) mdh(9) purA(7) adk(6) icd(136)

However, by the end of the file, I noticed that another allele of purA was not identified with 100% k-mer coverage, shown below; ND 100.00 recA(7) fumC(6) gyrB(5) mdh(9) purA(489) adk(6) icd(136)

I then changed the kmer parameter, varying from 7 to 17, but the pattern described above (no match/ match/ no match) was observed for all kmer tried.

I was now wondering what would be the correct way of interpreting the output files.

  1. What would be the correct way of interpreting the output file? Can I say that my ST is indeed 678 even though an 'ND' is shown at the last lines of the file?
  2. What would be the reason for the 'transition' correct ST (middle of the file) -> ND (end of the file)

Thanks in advance for reading this long question

andrewjpage commented 5 years ago

Thank you for your question. From your description it sounds like reads further down the file are causing this behaviour. The best strategy would be to map all your reads to a close reference, then look for purA and eyeball the pileup (particularly at the start of the gene) to see if you have a minority variant in there, or some other structural variant.