WGS-TB / SplitStrains

4 stars 3 forks source link

Interpreting the output #14

Open jemunro opened 2 years ago

jemunro commented 2 years ago

Hello,

Are you able to give some guidance on how to interpret the output? For example:

INFO:splitStrain.py has started.
INFO:sample name: SAMEA1100847.ERR2509676.recal.bam
INFO:reference name: Chromosome, reference length: 4411532
INFO:regionStart: 100, regionEnd: 4000000
INFO:depth threshold percent: 75
INFO:entropy threshold: 0.0
INFO:using gff: tuberculosis.filtered-intervals.gff
INFO:Likelihood Ratio Statistic: -2*log(LR) = 12495, treshold: 1920
INFO:using the model:GMM
file    alpha   min_LR_thresh   LR_statistic    log-p-value     p-value proportions
SAMEA1100847.ERR2509676.recal.bam       0.05    1920    12495   -14.367 0.000   0.83 0.17

How should I interpret this? I note the p-value is 0, does this mean that multiple strains are detected confidently?

In the manuscript 10.1099/mgen.0.000607 it is mentioned that the ROC curvers are generated using the likelihood ratio. Is that equivalent to the LR_statistic above? Is there a recommend threshold for the LR_statistic to discriminate between pure and mixed infections?

Thank you.