biobakery / MetaPhlAn

MetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data
http://segatalab.cibio.unitn.it/tools/metaphlan/index.html
MIT License
301 stars 86 forks source link

[Request] Tutorial on generating a custom database for MetaPhlAn 4 from scratch (not updating the existing database) #203

Closed jolespin closed 1 year ago

jolespin commented 1 year ago

Looking to create a custom MetaPhlaAn4 database using a target marker set but I’m not seeing any tutorials on how to do this.

I see a tutorial on customizing the database but not for creating new ones:

https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4#customizing-the-database

Is this documented anywhere? If not, could you do a brief tutorial on how to do this?

github-actions[bot] commented 1 year ago

Thank you for creating this issue. We currently field issues through our bioBakery Discourse Support Forum. If you would please post the issue to discourse we would be happy to sync up with you to get it resolved.

ljmciver commented 1 year ago

Hello @jolespin , Sorry we currently do not have a tutorial documenting how to add markers to MetaPhlAn v4. If you could describe what you need to do, maybe I can help answer questions to help get you started. Just let me know!

Thanks! Lauren

jolespin commented 1 year ago

Hi Lauren, Thanks for responding to this! Ideally I'd like to create a marine-specific metaphlan4 db but to start I will probably just add my sequences to the existing database to make it easier.

What I have is the following:

Here's an example of the sequence header structure I have:

>COI_2359223918bb7ae64fe3bf5702d8865d|k__Eukaryota;p__Arthropoda;c__Hexanauplia;o__Poecilostomatoida;f__Chondracanthidae;g__Acanthochondria;s__Acanthochondria_rectangularis
aactctttatttacttagaggtatttgatcaggaataatcggtaggaggctaagagtcttaattcgtttagaattaactcaagggggagcatttttaggtaatgaccaactttataatgttgtagttactgctcatgcttttgtaataattttttttatagttatacctattttaattggtggttttgggaactgattagtgcctttaataattggggctccagatatagccttccctcgattgaataatataagtttctgatttttaattccttctttatttatattagtgtctagaataataacagagagaggagcaggaacaggatgaaccgtgtaccctcctctcagaagaaatgtaagacacgccggatcttctgtagatctggtaattttttctttacacttagcaggggtttcttcaattttaggggctttaaattttatttctaccattgttaatttacggactcttggtttattcctggatcgaactccattattttgttgagcagttttagtaacagcagtattattattattatctttacctgttttagccggggctattacaatattattaacggatcgaaatttgaatacttctttttatgacccaaggggtggaggagat

Here's the tutorial that is available:

import bz2,pickle
with bz2.open("mpa_vOct22_CHOCOPhlAnSGB_202212.pkl", "r") as f:
    db = pickle.load(f)

# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1) # ('2|1224|28216|80840|80864|2828371|2259674|', 3775055)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade': the clade that the marker belongs to,
                                   'ext': {the GCA of the first external genome where the marker appears,
                                           the GCA of the second external genome where the marker appears,
                                          },
                                   'len': length of the marker,
                                   'taxon': the taxon of the marker
                                }

I have a few specific questions:

jolespin commented 1 year ago

Hi @ljmciver just checking in about this. Thanks!

ljmciver commented 1 year ago

Hi @jolespin Thanks for the ping! Sorry there are some of the questions I am not sure the answers. Yes the length is important. Yes I would follow the exact formatting of the ids as much as possible. I believe the clade is the terminal taxa. The final "t__" is the SGB (species genome bin) identifier in the MetaPhlAn v4 in v3 it was the strain. Hopefully those are enough answers to get you started! If you get stuck on anything feel free to ping again and I can sync up with the main MetaPhlAn developers to answer any questions I don't know.

Thanks! Lauren