PacificBiosciences / pb-metagenomics-tools

Tools and pipelines tailored to using PacBio HiFi Reads for metagenomics
BSD 3-Clause Clear License
165 stars 33 forks source link

Clarification on the "JGI.filtered.depth.txt Output File" #81

Closed CaroleBelliardo closed 1 month ago

CaroleBelliardo commented 2 months ago

Hello Dportik,

I am currently using the HiFi-MAG-Pipeline pb-metagenomics-tools and have encountered the output file "JGI.filtered.depth.txt Output File" and have several questions. What are the exact filtering parameters used to generate this file? Understanding the criteria applied during filtering would help interpret the results correctly. I am not sure to understand the last column of the file. Could you provide details on what this column represents and how the values are derived? The filename suggests a connection with JGI (Joint Genome Institute). Could you explain to me the relationship with this file processing? Thank you a lot for your help and this great work. Carole

CaroleBelliardo commented 2 months ago

I want to ask a second question: do the proteins in the DAStool directory correspond to all the predictions made on all the contigs given from input or just some?

dportik commented 2 months ago

Hi @CaroleBelliardo , The script that produces the SAMPLE.JGI.depth.txt file comes from metabat2 (jgi_summarize_bam_contig_depths, read more here). The last column represents variance of the depth of coverage across the contig listed in the row. This file is used to summarize depth of coverage for all contigs and is a required input file for metabat2.

The SAMPLE.JGI.filtered.depth.txt file is the same file, but with any single-contig, complete bins removed. These complete bins are detected as part of the first step of the completeness-aware strategy. Because those complete bins are not included in the metabat2 step, having them in the depth file used for metabat2 causes an error. If there were no complete bins detected in the first stage, the two files are identical.

I want to ask a second question: do the proteins in the DAStool directory correspond to all the predictions made on all the contigs given from input or just some?

It should be predictions for all contigs included in the bins from semibin2 and metabat2. It will not include predictions on other contigs that were not included in these bins.