apcamargo / magpurify2

Identify and remove contaminants from metagenome-assembled genomes
https://apcamargo.github.io/magpurify2/
GNU General Public License v3.0
16 stars 3 forks source link

Supported? #5

Closed adityabandla closed 8 months ago

adityabandla commented 9 months ago

@apcamargo Thanks for v2, is this package still actively supported?

apcamargo commented 9 months ago

I don't have any plans to go back to this project in the near future, but I can try to help you with any questions you have!

adityabandla commented 9 months ago

Thank you! I would like to give this a try, but unable to install it both using conda or pip

apcamargo commented 9 months ago

Do you get any message when you try to install it via pip? It should be available (see here)

adityabandla commented 9 months ago

Thanks, Antonio, for sharing the link. The pip error, as shown below, came up when I tried installing magpurify2 with Python v3.12, but I was able to successfully install it when I downgraded python to v3.8

Screenshot 2024-02-16 at 15 49 53

I took some time to go through the tool & readme, it's a great tool! I do have a few questions, please let me know if you would like me to open separate issues for each

  1. It would be great to run each module separately when doing the final contaminant filtering. I see that you have it on your to-do list. Any idea if this can be implemented?
  2. For coverage based filtering, I see that you calculate relative contig coverage by dividing the absolute coverage by the sum of the coverages of all contigs in a sample. However, I presume this will miss out on the unassembled portion & hence not give an accurate approximation of sequencing depth differences?
  3. How is the contig contamination probability estimated from the contig scores from each module? I see that you use an xgboost model to make these predictions, but I am missing the background here.
  4. The contig contamination probability is the probability that a contig is actually contamination? i.e., higher values increase the likelihood that it is in fact contamination?
  5. I was unable to download the database using the codes given in the repo. I was able to find this 10.5281/zenodo.3817702 on your paper. Is this the same database?
  6. How were the models under the models directory built?

Cheers, Adi

apcamargo commented 9 months ago

It would be great to run each module separately when doing the final contaminant filtering. I see that you have it on your to-do list. Any idea if this can be implemented?

Not sure if I understand this one. As far as I remember, each module is run separately and will compute separate scores for each contig.

For coverage based filtering, I see that you calculate relative contig coverage by dividing the absolute coverage by the sum of the coverages of all contigs in a sample. However, I presume this will miss out on the unassembled portion & hence not give an accurate approximation of sequencing depth differences?

This is true, but shouldn't interfere with the module. When binning (or trying to find binning problems, in this case), you're interested in finding covariations in contig coverages. Differences in the absolute coverage shouldn't really matter. For instance, VAMB performs binning using RPKM as a metric of abundance, effectively removing the effect of the sequencing depth.

How is the contig contamination probability estimated from the contig scores from each module? I see that you use an xgboost model to make these predictions, but I am missing the background here.

Each module will compute a score (or multiple scores) that measure how much each contig "belongs" to to core of the MAG. You can read about them in the documentation (which is incomplete, I'm sorry about that...). The purpose of the XGBoost model is to aggregate those scores into a single score which provides a better estimate than the individual score.

I'd avoid calling those scores "probabilities" because they aren't actual probabilities, they are just abstract measures of how confident the model is in the classification. I write a bit about this here.

The contig contamination probability is the probability that a contig is actually contamination? i.e., higher values increase the likelihood that it is in fact contamination?

From a brief look at the code, it looks like the score of the XGBoost model increases when the likelihood of the contig being a contaminant increases.

I was unable to download the database using the codes given in the repo. I was able to find this 10.5281/zenodo.3817702 on your paper. Is this the same database?

Yes!

How were the models under the models directory built?

I trained the XGBoost models on simulates MAGs with varying levels of contamination. The contaminant contigs also varied in level of taxonomic proximity to the "core" of the MAG.

adityabandla commented 9 months ago

Thanks, Antonio, for being super helpful!

Not sure if I understand this one. As far as I remember, each module is run separately and will compute separate scores for each contig.

Sorry for not phrasing my first question correctly. Indeed, the modules can be run separately, however, when filtering the MAGs, module outputs cannot be used independently, i.e., if I want to run only coverage-based filtering, the tool currently looks for composition scores as well in the fast_mode and all scores in the default mode.

This is true, but shouldn't interfere with the module. When binning (or trying to find binning problems, in this case), you're interested in finding covariations in contig coverages. Differences in absolute coverage shouldn't really matter. For instance, VAMB performs binning using RPKM as a metric of abundance, effectively removing the effect of the sequencing depth.

Thanks, this makes sense. I went through the documentation but could not find how the scores are computed for this module and hence what exactly the derived scores represent.

Based on my reading of your documentation, the direction of your scores is opposite for the TNF & GC. For TNF, the score represents the membership level (I presume distance?) to the core cluster. i.e., contigs closer to the core cluster in the lower dimensional space will have a smaller score, and vice versa. For GC, it states that contigs with divergent GC will have smaller scores.

From a brief look at the code, it looks like the score of the XGBoost model increases when the likelihood of the contig being a contaminant increases.

If this is the case, the default value seems quite stringent, and hence it can throw out a lot of contigs? Would something like 0.7 be a good place to start?

Thank you once again for patiently answering my questions in details. I am happy to contribute to the documentation if required.

Cheers, Adi

adityabandla commented 9 months ago

I studied the code a little bit more, & I see that contig scores for composition & coverage are either computed using log ratios or cluster memberships. Cluster memberships indicate the probability of a contig belonging to the core cluster; hence, the higher the better, and the same goes for log ratios. In this sense, yes, the directions of all scores are the same, divergent contigs result in smaller scores.

These module scores, however, contrast with that of the contaminant scores, where you say that contaminant scores increase with the likelihood of a contig being a contaminant?

Is my understanding here correct?

apcamargo commented 9 months ago

Sorry for not phrasing my first question correctly. Indeed, the modules can be run separately, however, when filtering the MAGs, module outputs cannot be used independently, i.e., if I want to run only coverage-based filtering, the tool currently looks for composition scores as well in the fast_mode and all scores in the default mode.

Ahh, ok. I think you could use scores individually if you want to. You just need to tune the cutoff for each one of them and keep in mind that they are inversely proportional to the probability of being a contaminant (as you pointed out).

If this is the case, the default value seems quite stringent, and hence it can throw out a lot of contigs? Would something like 0.7 be a good place to start?

This is precisely the point where I was when I paused the development. My idea was to have a dynamic cutoff where the stringency would be low for highly contaminated MAGs (to try to remove more contigs) and high for MAGs with low contamination (to try to remove very few contigs). I think you can play around with varying the cutoff and running CheckM to tune the stringency to your liking. 0.7 seems like a good place to start.

These module scores, however, contrast with that of the contaminant scores, where you say that contaminant scores increase with the likelihood of a contig being a contaminant?

This seems to be the case, as the code filters out anything above the threshold. But it's pretty easy to test too. Just run the pipeline for a MAG and take a look at the score distribution. Most of the contigs shouldn't be contaminants, so, if we are right about this, their scores should be low.

adityabandla commented 8 months ago

Thanks @apcamargo! This was very helpful