adamhockenberry / bacphlip

A python library for predicting phage lifestyle based on genome sequence
MIT License
35 stars 7 forks source link

Error when predicting phage contigs #8

Open KennthShang opened 2 years ago

KennthShang commented 2 years ago

Hi, thanks for providing such a good tool.

I noticed that in the Guidelines, bacphlip can predict only complete phage. However, as you know, in the metagenomic data, the assembled contigs maybe not such complete. Is there any possible way to run the tool on short contigs?

When I am running my data, I found there are some frames (some intermediate files I suppose) are empty files, and it will cause a problem when running hmmsearch.

Error: Sequence file temp_100_400.fna.BACPHLIP_DIR/r1.1.6frame is empty or misformatted

Is there a potential way to solve these problems? Thanks a lot.

adamhockenberry commented 2 years ago

Sorry for the long delay, and glad the tool is (mostly!) working for you. So I could likely provide a way to terminate the program more cleanly in these cases / skip the file, but I'm not sure if this is actually ideal behavior to implement.

For starters, if the file is truly empty then the contig must be veryyy short. Those ".6frame" translation files should find a lot of proteins even in a fairly short sequence, so if they are coming up empty I'd be very hesitant to trust any results from the program. My top suggestion for you just to get the program to run would be to simply remove (or move) any files from your directory that are below some length threshold.

All that said, even for longer contigs, the main issue with using this tool (or any similar tool) on incomplete data is that the "absence of evidence is not evidence of absence". At its core, this program looks for specific proteins and assigns genomes according to whether or not it finds them. The default assumption is that all phages analyzed are assumed to be lytic/virulent, but if certain lysogeny-associated proteins are found then the program predicts lysogenic/temperate. If a contig is 50% complete, the tool has no way of knowing whether the lysogeny-associated proteins-of-interest might actually be in the remaining 50% that we can't see, so any prediction should have very low confidence.

Of course, as completeness gets closer to 100% the odds of this scenario become less-and-less likely so I don't think the tool has no use at all on incomplete genomes, just that there is a big caveat that the lysogenic protein machinery might be hidden in the missing pieces of the genome so the more complete the better. And below some "completeness" threshold (not sure what this number is but 50% might be a good starting point), I would not recommend using BACPHLIP for classification at all. But my preference would probably be to set a completeness threshold closer to 80-90% to produce reliable results.