Running MUSiCC with custom gene IDs

zkstewart commented 7 years ago

Hi,

I was wondering if it were possible to run MUSiCC using custom gene IDs (i.e., not KOs). The metagenome I am working with had gene models predicted using an ab initio method, and thus not all the genes have corresponding KO assignments.

To make MUSiCC run with custom gene IDs, would it be a matter of replacing these gene IDs with their best-matching KO assignments wherever this is possible? Would duplicates need to have their read counts collapsed into a single entry, and would it matter if not all gene IDs have a KO label?

I could probably figure this out through trial and error, and that is why I deleted my original comment on the other issue. I've asked the question here again at your request. Apologies if this is detailed elsewhere.

Thanks, Zac.

omanor commented 7 years ago

MUSiCC uses a set of KOs that are both universal (i.e., appear in almost all bacteria) and are almost always encoded by a single copy. Therefore, it has to find these KOs in the samples to be able to normalize the samples. If you have other gene IDs, you can try to map them to the list of USiCGs found here (https://github.com/borenstein-lab/MUSiCC/blob/master/musicc/data/uscg_76_kegg_min_2313.lst) and try to run MUSiCC

hhollandmoritz commented 3 years ago

Hello,

I'm trying to do something similar to what Zac wanted to do. I'm wondering what is the difference between the semi USCG list and the USCG list in the data folder? In particular, I'm not sure I understand if the semi USCG list is required for MUSiCC to perform normalization properly, or if it is simply a nice bit of extra information/capability in the program?

Thanks! Hannah

engal commented 3 years ago

Hi Hannah,

The semi-USCG list (genes that are single copy in many genomes, but not quite as prevalent as the genes in the USCG list) is used for validation purposes, when you learn a model with MUSiCC, one of the statistics it can report is performance on semi-USCG prediction. This gives you a benchmark for the correction for a set of genes that you expect to be close to single copy, but were not used during model training.

Hope that helps!

hhollandmoritz commented 3 years ago

It does, thanks!

aimirza commented 3 years ago

MUSiCC uses a set of KOs that are both universal (i.e., appear in almost all bacteria) and are almost always encoded by a single copy. Therefore, it has to find these KOs in the samples to be able to normalize the samples. If you have other gene IDs, you can try to map them to the list of USiCGs found here (https://github.com/borenstein-lab/MUSiCC/blob/master/musicc/data/uscg_76_kegg_min_2313.lst) and try to run MUSiCC

I wanted to normalize my Enzyme Commission (EC) profiles generated from humann3 (regroup function) and map to the CAZy database. I linked KOs to ECs and was only able to map 30 out of the 76 USiCGs, using the R function KEGGREST::keggLink("enzyme", as.character(uscg_76_kegg)). uscg_76_kegg contains the 76 USiCGs. Below are the results. I divided each sample by the median of these 30. Does this look right?

  ko:K00133      ko:K00789      ko:K00927      ko:K00939      ko:K01689      ko:K01803      ko:K01866 
 "ec:1.2.1.11"   "ec:2.5.1.6"   "ec:2.7.2.3"   "ec:2.7.4.3"  "ec:4.2.1.11"   "ec:5.3.1.1"   "ec:6.1.1.1" 
     ko:K01867      ko:K01868      ko:K01869      ko:K01870      ko:K01872      ko:K01873      ko:K01874 
  "ec:6.1.1.2"   "ec:6.1.1.3"   "ec:6.1.1.4"   "ec:6.1.1.5"   "ec:6.1.1.7"   "ec:6.1.1.9"  "ec:6.1.1.10" 
     ko:K01875      ko:K01876      ko:K01881      ko:K01883      ko:K01887      ko:K01889      ko:K01890 
 "ec:6.1.1.11"  "ec:6.1.1.12"  "ec:6.1.1.15"  "ec:6.1.1.16"  "ec:6.1.1.19"  "ec:6.1.1.20"  "ec:6.1.1.20" 
     ko:K01892      ko:K01937      ko:K02528      ko:K03040      ko:K03106      ko:K03438      ko:K03470 
 "ec:6.1.1.21"   "ec:6.3.4.2" "ec:2.1.1.182"   "ec:2.7.7.6"   "ec:3.6.5.4" "ec:2.1.1.199"  "ec:3.1.26.4" 
     ko:K09903      ko:K10773 
 "ec:2.7.4.22" "ec:4.2.99.18"

engal commented 3 years ago

Those assignments look correct. You might even be able to run MUSiCC using your EC data if you replace the uscg and semi-uscg lists in the data folder with a list of these ECs (e.g. copy those default lists somewhere to save them, replace the uscg contents with this list of ECs, and replace the semi-uscg list with an empty file), though I haven't fully tested this functionality.

borenstein-lab / MUSiCC

Running MUSiCC with custom gene IDs #2