bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
402 stars 190 forks source link

Please include strain heterogeniety score from CheckM #286

Open franciscozorrilla opened 4 years ago

franciscozorrilla commented 4 years ago

Hi, I think it is a shame that the strain heterogeneity scores from the temporary CheckM results are not compiled to the final refined/reassembled bins statistics file. Perhaps it can be found somewhere in the temporary/intermediate result files? I had to delete most of these due to size/file number limitations. In any case, to avoid storage issues or re-running checkM, it could be a good idea to include the checkm strain heterogeneity score in the final statistics files for the refinement/reassembly modules.

Best, FZ

ursky commented 4 years ago

The heterogeneity score is a very misleading score to many people, and most users will never need it, so I opted to exclude it to avoid confusion. In my experience, many biologists see it and assume that the heterogeneity score is another type of contamination (in other words the full contamination = contamination + heterogeneity), when in reality it's just a metric to describe the type of contamination you have. So you can have a complete MAG with 1% contamination but with 100% heterogeneity, which can be confusing unless you know where the numbers come from.

franciscozorrilla commented 4 years ago

Valid points, I had a hard time figuring out what heterogeneity meant when I first started using CheckM, and indeed the big MAG papers only consider completeness and contamination scores when filtering for "good" MAGs. However, I am now interested in identifying which bugs have high strain heterogeniety in multi-species communities, and perhaps MAGs with high contamination that's mostly coming from strain heterogeniety could still be useful as a sort of strain pan-genome. Any thoughts on this?