Open misialq opened 5 months ago
To make this happen we would need to assert that contig names are unique across all samples. The easiest way to do it would be right after assembly: we could just rename all the contigs that came from each assembler to resemble something like: <sample_id>_<assembler_name>_<original_contig_id>
- inserting the assembler name in the middle would make it easier to ensure that we can identify the sample/contig ID from the header by splitting it on that name. Alternatively, it could just be some arbitrary string (that would make it a bit easier and more universal). On the other hand, this means that users who already assembled using MOSHPIT (and indexed the contigs + mapped etc) would need to either re-assemble or export the data, rename and re-import. Not great. I do think, though, we really should do it now rather than later - there are already two places I found where contig uniqueness is required (here and in the MAG normalization which I'm working on now)... Do you have any thoughts on this @nbokulich, @gregcaporaso? Let me know in case you need more details.
I worry that
Perhaps the original contig ID could already be unique? We could assign all contigs a uuid4, as we did for the bins. This is also backwards-incompatible, but that is an issue that we can solve in other ways... make an action to rename any sequences with uuids? This seems like a good thing to have anyway, as we have already encountered the case where a user wants to import their assembled contigs (or mags) and continue with moshpit, so this could make it easier to import and use partially processed data.
Contig IDs are not unique across samples, since assembly is done on a per-sample basis. I also don't really like this naming convention proposed above... I was not crazy about a uuid either, though, because maybe sometimes it's not bad to have a more human-readable contig id in case one wants to "manually" check something (and those ids get propagated very far downstream). If we go for a uuid maybe we could use a different uuid type for the very reason that we use uuid4 for MAGs already - it would be easier to keep track of things if we could use a different system for contigs. My counter-proposal would be to use shortuuid - it should be unique enough to distinguish between contigs from many, many samples but short and readable enough to make it easier on the users. I also do like the idea of an additional action which would help with renaming!
This PR/issue depends on:
Is your feature request related to a problem? Please describe. No.
Describe the solution you'd like Currently, the
bin-contigs-metabat
action performs binning in a per-sample fashion. It would be beneficial to bin across all samples, as it is indicated in the metabat's documentation. We should probably make this a default behaviour, configurable through a parameter flag (--p-all-samples
).Additional context See this paper for more context. Links to some tutorials are also available here.
Depends on https://github.com/bokulich-lab/q2-assembly/issues/82.