Closed jolespin closed 1 year ago
Thank you for creating this issue. We currently field issues through our bioBakery Discourse Support Forum. If you would please post the issue to discourse we would be happy to sync up with you to get it resolved.
Hello @jolespin , Sorry we currently do not have a tutorial documenting how to add markers to MetaPhlAn v4. If you could describe what you need to do, maybe I can help answer questions to help get you started. Just let me know!
Thanks! Lauren
Hi Lauren, Thanks for responding to this! Ideally I'd like to create a marine-specific metaphlan4 db but to start I will probably just add my sequences to the existing database to make it easier.
What I have is the following:
Here's an example of the sequence header structure I have:
>COI_2359223918bb7ae64fe3bf5702d8865d|k__Eukaryota;p__Arthropoda;c__Hexanauplia;o__Poecilostomatoida;f__Chondracanthidae;g__Acanthochondria;s__Acanthochondria_rectangularis
aactctttatttacttagaggtatttgatcaggaataatcggtaggaggctaagagtcttaattcgtttagaattaactcaagggggagcatttttaggtaatgaccaactttataatgttgtagttactgctcatgcttttgtaataattttttttatagttatacctattttaattggtggttttgggaactgattagtgcctttaataattggggctccagatatagccttccctcgattgaataatataagtttctgatttttaattccttctttatttatattagtgtctagaataataacagagagaggagcaggaacaggatgaaccgtgtaccctcctctcagaagaaatgtaagacacgccggatcttctgtagatctggtaattttttctttacacttagcaggggtttcttcaattttaggggctttaaattttatttctaccattgttaatttacggactcttggtttattcctggatcgaactccattattttgttgagcagttttagtaacagcagtattattattattatctttacctgttttagccggggctattacaatattattaacggatcgaaatttgaatacttctttttatgacccaaggggtggaggagat
Here's the tutorial that is available:
import bz2,pickle
with bz2.open("mpa_vOct22_CHOCOPhlAnSGB_202212.pkl", "r") as f:
db = pickle.load(f)
# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1) # ('2|1224|28216|80840|80864|2828371|2259674|', 3775055)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade': the clade that the marker belongs to,
'ext': {the GCA of the first external genome where the marker appears,
the GCA of the second external genome where the marker appears,
},
'len': length of the marker,
'taxon': the taxon of the marker
}
I have a few specific questions:
k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Comamonadaceae|g__Calidifontimicrobium|s__Calidifontimicrobium_sediminis|t__SGB32561
What is the t__SGC
level of taxonomy? Is this a unique genome assembly ID? Is this essential?In [6]: db["markers"]['UniRef90_UPI000E65AEFC|1__27|SGB32561']
Out[6]:
{'clade': 't__SGB32561',
'ext': [],
'len': 4050,
'taxon': 'k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Comamonadaceae|g__Calidifontimicrobium|s__Calidifontimicrobium_sediminis|t__SGB32561'
UniRef90_UPI000E65AEFC|1__27|SGB32561
In [13]: list(db["merged_taxon"].keys())[0]
Out[13]:
('k__Bacteria|p__Thermotogae|c__Thermotogae|o__Thermotogales|f__Fervidobacteriaceae|g__Thermosipho|s__Thermosipho_affectus|t__SGB24702',
'2|200918|188708|2419|1643950|2420|660294|')
Hi @ljmciver just checking in about this. Thanks!
Hi @jolespin Thanks for the ping! Sorry there are some of the questions I am not sure the answers. Yes the length is important. Yes I would follow the exact formatting of the ids as much as possible. I believe the clade is the terminal taxa. The final "t__" is the SGB (species genome bin) identifier in the MetaPhlAn v4 in v3 it was the strain. Hopefully those are enough answers to get you started! If you get stuck on anything feel free to ping again and I can sync up with the main MetaPhlAn developers to answer any questions I don't know.
Thanks! Lauren
Looking to create a custom MetaPhlaAn4 database using a target marker set but I’m not seeing any tutorials on how to do this.
I see a tutorial on customizing the database but not for creating new ones:
https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4#customizing-the-database
Is this documented anywhere? If not, could you do a brief tutorial on how to do this?