Open sgsutcliffe opened 3 years ago
I was thinking about this issue, would it be worthwhile to use the entire genome as host coverage predictor rather than the scaffolds when you have contiguous genomes? I know coverage can vary over the genome/scaffolds but would help with my previous problem. When possible I think using your current approach is better though. :)
Hi,
I'm not sure how I missed this post. I just now saw it and I'm sorry about that!
Yes, sounds like the host had 0 data points so PropagAtE was trying to compare the prophage to nothing. I should add an exception to avoid the error. For your second post you make a good argument, but PropagAtE was created for metagenomic data in which the entire host genome cannot be accurately identified. Even with binning a MAG there will be multiple scaffolds (likely) that are contamination and may alter the coverage results. It is more accurate to consider only the parent scaffold of the prophage. I will consider adding an option to specify an entire MAG though I wouldn't count on that being implemented.
Thanks for the reply!
Since making the post I've come around to the idea that relying using prophages without flanking-host regions is bad or at least risky. This paper highlights the issue of 'miss-binning' prophages when bacterial MAGs are closely related: https://doi.org/10.1038/s41587-020-0718-6
I agree. Great group to rely on who wrote that paper. I have a manuscript coming out in ~1 week regarding viral binning, including a couple points on binning prophages. Self promotion :) but also it may be of interest when available.
Honestly, really looking forward to reading it! I am alway on the look out for ideas for improving binning prophages in MAGs.
I will be submitting a paper soon where I went into a lot of effort to retrieve prophages from bins, and most without host-flanking regions. Hopefully I can convince you and Simon Roux that my assumptions are correct. It was quite a bit of work, and even then there is always the haunting risk of false-positives. I did my best to confirm they were true active prophages with viral sequencing but I found only 72% of active prophages were found in viral samples (what I consider likely true active prophages). Even then I think it was due to the benefit of having longitudinal sequencing. Moving forward I think your approach is better!
Thanks again for putting out PropagAtE!
Steven Sutcliffe Maurice Lab McGill
On Dec 2, 2021, at 4:02 PM, Kris Kieft @.***> wrote:
I agree. Great group to rely on who wrote that paper. I have a manuscript coming out in ~1 week regarding viral binning, including a couple points on binning prophages. Self promotion :) but also it may be of interest when available.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AnantharamanLab/PropagAtE/issues/3#issuecomment-984998256, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALNOJ27DU5XIKALUNLM2OMLUO7NF7ANCNFSM4236TVPA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
At the statistical stage I am getting the error:
"statistics.StatisticsError: mean requires at least one data point"
It comes from line 758 avg_h = statistics.mean(cov_h)
I think it's because I am working with a dataset you probably hadn't expected. I've concatenated multiple prophages, and this leads to prophages that span the entire scaffold. So I guess that would have host-coverage of 0?
Is this the issue?