PacificBiosciences / pb-metagenomics-tools

Tools and pipelines tailored to using PacBio HiFi Reads for metagenomics
BSD 3-Clause Clear License
178 stars 35 forks source link

Add internal sequences to taxonomy? #75

Closed cjprybol closed 7 months ago

cjprybol commented 7 months ago

Name the workflow Taxonomic-Profiling-Minimap-Megan and Taxonomic-Profiling-Diamond-Megan

Describe the question and context Is it possible to add internal sequences to the taxonomy databases? It's easy enough to add internal sequences to the nt and nr databases before converting into minimap and diamond databases for the respective workflows, but it's not clear how to provide the taxonomy information for those internal sequences since they are not in the public NCBI and GTDB taxonomy databases.

Improvements to documentation

Please ensure your question is not already addressed in the tutorials and documentation. If it is not, please suggest where additional documentation could be provided to address the question.

I looked through the documentation and couldn't find this. I expected to find it in configuration, but I noticed just above that there is a mention:

You can always use a customized nt database, for example a subset of the NCBI nt database.

but nothing on how to expand the nt database.

Thanks!

cjprybol commented 7 months ago

Possibly found the answer https://megan.cs.uni-tuebingen.de/t/adding-custom-entries-to-mapping-database/1611 ?

cjprybol commented 7 months ago

I can't find the files referenced in that thread, which makes me think that they're integrated into the megan map sqlite databases? If that's true and y'all had any easy way to allow users to specify additional sequences and taxonomy info as part of the pre-run configuration, that would be amazing. If I'm missing something and there is an easier way, please let me know!

dportik commented 7 months ago

Hi @cjprybol , Looks like you found my thread on the MEGAN forum! I briefly tried to create a custom mapping file, but quickly gave up. There is no easy way to integrate custom genomes/taxa into the standard NCBI databases with MEGAN.

You might consider looking in to this for sourmash instead: https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#id4

cjprybol commented 7 months ago

Thank you for the response and recommendation!