borenstein-lab / MUSiCC

MUSiCC: A marker genes based framework for metagenomic normalization and accurate profiling of gene abundances in the microbiome
BSD 3-Clause "New" or "Revised" License
15 stars 5 forks source link

Negative Values for some KOs after normalization? #12

Closed hhollandmoritz closed 3 years ago

hhollandmoritz commented 4 years ago

Hello,

Thanks for creating this tool! We have come across a situation where we get negative counts for some of the KOs after normalization and learn_model correction. Many of these genes appear to be house-keeping genes (e.g. ribosomal proteins) so our current hypothesis is that they are single copy genes that aren't classified by MUSiCC as single-copy. Before we move forward, I wanted to check to make sure that you haven't seen negative counts show up for some other reason and that we aren't making some error in usage.

The data are from soil and rhizosphere samples so we are using learn_model rather than the HMP model.

Here is the data that we used to run the program: https://drive.google.com/file/d/1rp5hnWqF-AsXa1WA-L09-D2lemkvpG-w/view?usp=sharing

This is the command we used to run it:

run_musicc.py gene_abundance.csv -if csv -of csv -o gene_abundance_normalized.csv -n -c learn_model -perf -v

Thanks for your help!

engal commented 4 years ago

Hi,

Just wanted to let you know we're looking into this issue. We haven't come across this issue in the past, so I don't have any immediate advice for you.

I'll let you know if we do discover the underlying cause for this issue.

hhollandmoritz commented 4 years ago

Sounds great, thanks for the help! I should mention, that it's only the learn_model step that generates the negative values. The initial normalization works fine.

Also, in case it's relevant, the first 12 samples are bulk soil samples and the second 12 are rhizosphere samples. We thought at first that maybe learn_model was failing because the samples from different environments needed to be treated differently. Since rhizospheres tend to have more fast-responding species than bulk soil (and so potentially have different base-line copy number variation). Unfortunately, when we separate the rhizosphere from bulk soil samples, we still see negative values.

The other thought we had was that perhaps there is something odd about the annotations or mapping for gene counts. (Though I wouldn't say I really understand MUSiCC's algorithm well enough to be confident in how annotation methods might or might not have downstream impacts). The annotations were made using the eggnog emapper.py script and the read counts were generated with salmon.

Let me know if there's any other information that might be helpful.

engal commented 4 years ago

I have an update, and a possible solution.

Based on testing with your data (mainly focusing on your first sample at the moment), it looks like there are some genes expected to be universal single-copy genes that are extreme outliers (e.g. 0 abundance/absent in the sample, or very low abundance). This seems to be impacting model learning, causing the model to predict very slight negative correction factors, which will then result in negative abundances. However, by removing all 0 abundance KOs (which will include those outlier universal single-copy genes) from a sample, it looks like the learned model no longer makes negative value predictions.

I don't know the underlying reason for why some of your samples have such extreme low abundance for what we expect to be normally universal single-copy genes, but I think the simplest solution for the moment is to split your samples into separate files, remove all rows for 0 abundance KOs, run MUSiCC on each abundance file separately, then recombine.

Regarding your concern about samples from different environments being an issue, I should note that MUSiCC learns a different model for each sample, it never takes into account multiple samples when correcting a single sample. The two types of correction (intra vs. inter) refer to, respectively, correcting abundances due to potential sequencing/abundance calculation biases within the sample and correcting abundances based on universal single-copy gene abundance so that corrected abundances are comparable between samples.

Hope that helps!

hhollandmoritz commented 4 years ago

This helps a ton! Thank you so much!