bokulich-lab / q2-moshpit

MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking
BSD 3-Clause "New" or "Revised" License
3 stars 12 forks source link

ENH: allow binning across all samples #152

Open misialq opened 5 months ago

misialq commented 5 months ago

Is your feature request related to a problem? Please describe. No.

Describe the solution you'd like Currently, the bin-contigs-metabat action performs binning in a per-sample fashion. It would be beneficial to bin across all samples, as it is indicated in the metabat's documentation. We should probably make this a default behaviour, configurable through a parameter flag (--p-all-samples).

Additional context See this paper for more context. Links to some tutorials are also available here.

Depends on https://github.com/bokulich-lab/q2-assembly/issues/82.

misialq commented 4 months ago

To make this happen we would need to assert that contig names are unique across all samples. The easiest way to do it would be right after assembly: we could just rename all the contigs that came from each assembler to resemble something like: <sample_id>_<assembler_name>_<original_contig_id> - inserting the assembler name in the middle would make it easier to ensure that we can identify the sample/contig ID from the header by splitting it on that name. Alternatively, it could just be some arbitrary string (that would make it a bit easier and more universal). On the other hand, this means that users who already assembled using MOSHPIT (and indexed the contigs + mapped etc) would need to either re-assemble or export the data, rename and re-import. Not great. I do think, though, we really should do it now rather than later - there are already two places I found where contig uniqueness is required (here and in the MAG normalization which I'm working on now)... Do you have any thoughts on this @nbokulich, @gregcaporaso? Let me know in case you need more details.

nbokulich commented 4 months ago

I worry that __ could still lead to accidental clashes, e.g., if merging across runs with non-unique sample IDs. But maybe that's an unlikely edge case that isn't currently supported anyway? It also introduces an arbitrary naming convention, which I don't love, esp. because from your description it sounds like we do not actually need that information, we just need a unique ID.

Perhaps the original contig ID could already be unique? We could assign all contigs a uuid4, as we did for the bins. This is also backwards-incompatible, but that is an issue that we can solve in other ways... make an action to rename any sequences with uuids? This seems like a good thing to have anyway, as we have already encountered the case where a user wants to import their assembled contigs (or mags) and continue with moshpit, so this could make it easier to import and use partially processed data.

misialq commented 4 months ago

Contig IDs are not unique across samples, since assembly is done on a per-sample basis. I also don't really like this naming convention proposed above... I was not crazy about a uuid either, though, because maybe sometimes it's not bad to have a more human-readable contig id in case one wants to "manually" check something (and those ids get propagated very far downstream). If we go for a uuid maybe we could use a different uuid type for the very reason that we use uuid4 for MAGs already - it would be easier to keep track of things if we could use a different system for contigs. My counter-proposal would be to use shortuuid - it should be unique enough to distinguish between contigs from many, many samples but short and readable enough to make it easier on the users. I also do like the idea of an additional action which would help with renaming!

github-actions[bot] commented 4 months ago

This PR/issue depends on: