Open zkstewart opened 7 years ago
MUSiCC uses a set of KOs that are both universal (i.e., appear in almost all bacteria) and are almost always encoded by a single copy. Therefore, it has to find these KOs in the samples to be able to normalize the samples. If you have other gene IDs, you can try to map them to the list of USiCGs found here (https://github.com/borenstein-lab/MUSiCC/blob/master/musicc/data/uscg_76_kegg_min_2313.lst) and try to run MUSiCC
Hello,
I'm trying to do something similar to what Zac wanted to do. I'm wondering what is the difference between the semi USCG list and the USCG list in the data folder? In particular, I'm not sure I understand if the semi USCG list is required for MUSiCC to perform normalization properly, or if it is simply a nice bit of extra information/capability in the program?
Thanks! Hannah
Hi Hannah,
The semi-USCG list (genes that are single copy in many genomes, but not quite as prevalent as the genes in the USCG list) is used for validation purposes, when you learn a model with MUSiCC, one of the statistics it can report is performance on semi-USCG prediction. This gives you a benchmark for the correction for a set of genes that you expect to be close to single copy, but were not used during model training.
Hope that helps!
It does, thanks!
MUSiCC uses a set of KOs that are both universal (i.e., appear in almost all bacteria) and are almost always encoded by a single copy. Therefore, it has to find these KOs in the samples to be able to normalize the samples. If you have other gene IDs, you can try to map them to the list of USiCGs found here (https://github.com/borenstein-lab/MUSiCC/blob/master/musicc/data/uscg_76_kegg_min_2313.lst) and try to run MUSiCC
I wanted to normalize my Enzyme Commission (EC) profiles generated from humann3 (regroup function) and map to the CAZy database. I linked KOs to ECs and was only able to map 30 out of the 76 USiCGs, using the R function KEGGREST::keggLink("enzyme", as.character(uscg_76_kegg))
. uscg_76_kegg contains the 76 USiCGs. Below are the results. I divided each sample by the median of these 30. Does this look right?
ko:K00133 ko:K00789 ko:K00927 ko:K00939 ko:K01689 ko:K01803 ko:K01866
"ec:1.2.1.11" "ec:2.5.1.6" "ec:2.7.2.3" "ec:2.7.4.3" "ec:4.2.1.11" "ec:5.3.1.1" "ec:6.1.1.1"
ko:K01867 ko:K01868 ko:K01869 ko:K01870 ko:K01872 ko:K01873 ko:K01874
"ec:6.1.1.2" "ec:6.1.1.3" "ec:6.1.1.4" "ec:6.1.1.5" "ec:6.1.1.7" "ec:6.1.1.9" "ec:6.1.1.10"
ko:K01875 ko:K01876 ko:K01881 ko:K01883 ko:K01887 ko:K01889 ko:K01890
"ec:6.1.1.11" "ec:6.1.1.12" "ec:6.1.1.15" "ec:6.1.1.16" "ec:6.1.1.19" "ec:6.1.1.20" "ec:6.1.1.20"
ko:K01892 ko:K01937 ko:K02528 ko:K03040 ko:K03106 ko:K03438 ko:K03470
"ec:6.1.1.21" "ec:6.3.4.2" "ec:2.1.1.182" "ec:2.7.7.6" "ec:3.6.5.4" "ec:2.1.1.199" "ec:3.1.26.4"
ko:K09903 ko:K10773
"ec:2.7.4.22" "ec:4.2.99.18"
Those assignments look correct. You might even be able to run MUSiCC using your EC data if you replace the uscg and semi-uscg lists in the data folder with a list of these ECs (e.g. copy those default lists somewhere to save them, replace the uscg contents with this list of ECs, and replace the semi-uscg list with an empty file), though I haven't fully tested this functionality.
Hi,
I was wondering if it were possible to run MUSiCC using custom gene IDs (i.e., not KOs). The metagenome I am working with had gene models predicted using an ab initio method, and thus not all the genes have corresponding KO assignments.
To make MUSiCC run with custom gene IDs, would it be a matter of replacing these gene IDs with their best-matching KO assignments wherever this is possible? Would duplicates need to have their read counts collapsed into a single entry, and would it matter if not all gene IDs have a KO label?
I could probably figure this out through trial and error, and that is why I deleted my original comment on the other issue. I've asked the question here again at your request. Apologies if this is detailed elsewhere.
Thanks, Zac.