dib-lab / sourmash_plugin_pangenomics

tools for sourmash-based pangenome analyses
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

consider outputting extra info in pangenome databases #14

Open ctb opened 3 months ago

ctb commented 3 months ago

we have two problems with our current pangenomics databases -

first, they present as "regular" sourmash sketches, with abundances. This could lead to misuse/mistakes.

second, it is annoying to track extra information (e.g. lineage counts as in https://github.com/dib-lab/sourmash_plugin_pangenomics/issues/13) in a separate file.

there is an analogous issue over in sourmash, https://github.com/sourmash-bio/sourmash/issues/2216, that talks about including taxonomy files in zip databases: the idea is that we can provide various standard lineage files in the actual .zip file databases, and then switch between them using CLI options (--gtdb and --ncbi, etc.)

so one idea here would be to produce the pangenome zip file full of sketches, and then add an extra file or two that indicate it's a pangenome database. This wouldn't necessarily prevent misuse (item 1 above) unless we adopted more metadata-in-zip-files in sourmash generally, but would help a great deal with carting around extra files (item 2). and the extra files would help with debugging, potentially.

it is kinda interesting to think about how to add more metadata in generally; this is the closest thing we have over in sourmash-land: https://github.com/sourmash-bio/sourmash/issues/2180

Related issues: