AnantharamanLab / PropagAtE

Prophage Activity Estimator
GNU General Public License v3.0
25 stars 2 forks source link

Processing PropagAtE output #11

Open asierFernandezP opened 1 year ago

asierFernandezP commented 1 year ago

Hi,

I am currently applying PropagAtE (default parameters) to gut metagenomic data and I am struggling to understand which filters to apply to the output .tsv files. I see that in the output .tsv files PropagAtE predicts a value (dormant/active) even for prophages with very low breadth of coverage. Should I use certain cut-off based on 'prophage_cov_breadth' column?

In the supplementary Table S3B of the propagAtE paper, I see that the values for this column that you report are really high for CRC or HeQ datasets, but the this goes down for other datasets. Still, you considered them as present in your analyses. Am I misunderstanding the meaning of this column? Would you recommend any kind of post-filtering of prophages after running PropagAtE with default parameters (considering that a high number of potential prophage sequences are run against the sequencing reads of each metagenomic sample, so I expect only a few of them to be present in each)?

Thank you!

KrisKieft commented 7 months ago

There is no direct answer for this. The reason you're seeing varying breadth of coverages is mainly due to the size of the input dataset. Datasets with more reads with naturally get higher breadth of coverage on average per sequence. The cutoff you choose may have to be modified depending on the size of your input dataset as well as what potential range of error you're willing to have. The higher you set the cutoff the more confidence there is in a prediction. The default is set rather low to be inclusive since many phages end up being in low abundance.