Can the GC score calculation be improved?

chris-rands commented 4 years ago

Phigaro currently uses the GC score in a way that prophages are more likely to be GC-rich (i.e. genes have high GC content) than GC-poor. Although the benchmarking suggests this is a good approach it is puzzling biologically because we don't expect phages/prophages to have high GC content generally (indeed apparently this is not even the case in the training data). Instead, we might expect that the GC content of a prophage is different from that of the surrounding bacterial region. In this case, looking at the difference between the bacterial and prophage GC content (i.e. the GC skew) would be expected to be more useful than the absolute GC content.

A possible explanation for this puzzle is that, I think, for genes without significant pVOG hits or genes with no pVOG hits at all the GC content is set to 0. I'm basing this understanding on the hmmer_res_to_gc function here in data.py. This might be downwardly biasing the GC content for non-phage genes (i.e. genes without significant pVOG hits) by artificially assigning these genes a score of 0 rather than their actual GC content. This might explain why Phigaro tends to think prophages have high GC content. Note that even for genes without a pVOG hit the GC content is still calculated by Prodigal but these values are never used I think because the GC content is currently only read from the HMMER3 output.

If my understanding is correct (it might be wrong of course!) then my suggestion is for genes without significant pVOG hits the actual GC content of the gene is used rather than a 0 value. Of course this will impact in at least a minor way the resulting phage predictions. Also, this might point towards using a GC skew rather than the absolute GC content, but that remains to be seen. As I said before, I am not confident about this, so please do correct me if I have misunderstood, thank you

chris-rands commented 4 years ago

Nice to see the new modes in the new releases, well done! Did my points above make sense? Did you consider using the GC content from the Prodigal output when it can't be read from the HMMER3 output?

PollyTikhonova commented 4 years ago

Hello, Firstly, I wanted to say that yes, you understand correctly that we use "filtered" GC content (only from pVOG genes) and yes, this is the actual reason of the phenomenon that the final GC score at prophage regions is higher. (Sorry, I did not answered you earlier, I thought Elizaveta wrote you about that)

Secondly, now, Phigaro has the "abs" mode that considers GC skew (in your meaning) instead of the abcolute values of GC content recovered from HMMER. But still, we use only "filtered" scores (that we get from hmmer) just because we mainly need them to sharpen and clarify the boundaries of the prophage regions. So, I thought that considering all GC scores could mess up the principal peaks which come from the phage score, that's why I used only filtered scores.

bobeobibo / phigaro

Can the GC score calculation be improved? #12