Open gregcaporaso opened 1 year ago
@ebolyen pointed out that we can add a method on the format that gives us tuples or a dict mapping sample ids to relative filepaths in the Archive, so that will achieve the same goal without having to create new manifest formats. So we should do that for the next release (2023.11), and take a pass to use these throughout in places where we're parsing filenames to access the sample ids.
Sounds good - thanks @gregcaporaso and @ebolyen! 🙏
Since this issue was opened in the context of the ContigSequencesDirFmt
, let's update all code that is using that format (e.g., here) as needed to use the lookup that @colinvwood added in #57, and then close this one out.
@colinvwood, could you put this on your list? It's a nice-to-have for the 2023.11 release, but not essential.
@gregcaporaso what about other formats that could use this approach, e.g. the PR @misialq linked above--should we open separate issues? that is in fact its own issue not a PR, whoops
After looking at the places in q2-assembly where we could use this new approach, I think that things will be much more refactorable once we have corresponding functionality for the mag directory format(s).
Ok, makes sense. Do we have an issue for updating the MAG directory format(s)? Could you create one if not?
Changes in q2-assembly are (at least partially) blocked by bokulich-lab/q2-assembly#66
@colinvwood and @Oddant1 noticed while working on https://github.com/bokulich-lab/q2-moshpit/issues/63 and https://github.com/bokulich-lab/q2-assembly/pull/46 that many of the formats in q2-types-genomics require some knowledge of the underlying format to identify the sample ids. This is requiring code like:
to crop up fairly frequently (that example is from https://github.com/bokulich-lab/q2-assembly/pull/46). If a developer misses this, or doesn't handle the stripping of
_contigs
correctly (e.g., replacing all occurrences, rather than just stripping the last occurrence) downstream results could have misnamed sample ids (e.g.,sample-1
could easily becomesample-1_contigs
, orsample2_contigs
[a terrible sample name] could becomesample2
).We're thinking that, after the alpha release, it will make sense to define new formats that contain manifests which explicitly map sample ids to filenames, and then add transformers from these existing formats to the new manifest formats. That way we encapsulate all knowledge about how the filenames map to sample ids in those transformers, and actions that need these data can use the manifest formats and have explicit access to sample ids. This would be similar to how we handle the "demux formats" in q2-types (see an example of how this is used here).
We would want to keep the existing formats as-is so that existing artifacts (generated pre- or post-alpha) would continue to work.