add formats that make sample (or other) ids explicit

gregcaporaso commented 1 year ago

@colinvwood and @Oddant1 noticed while working on https://github.com/bokulich-lab/q2-moshpit/issues/63 and https://github.com/bokulich-lab/q2-assembly/pull/46 that many of the formats in q2-types-genomics require some knowledge of the underlying format to identify the sample ids. This is requiring code like:

result = ContigSequencesDirFmt()

for sample_fp in samples:
    # These paths are defined in the ContigSequencesDirFmt class as
    # {sample_id}_contigs.(fa | fasta). This should get the id from a
    # name like that
    sample_id = sample_fp.name.rsplit("_contigs", 1)[0]

to crop up fairly frequently (that example is from https://github.com/bokulich-lab/q2-assembly/pull/46). If a developer misses this, or doesn't handle the stripping of _contigs correctly (e.g., replacing all occurrences, rather than just stripping the last occurrence) downstream results could have misnamed sample ids (e.g., sample-1 could easily become sample-1_contigs, or sample2_contigs [a terrible sample name] could become sample2).

We're thinking that, after the alpha release, it will make sense to define new formats that contain manifests which explicitly map sample ids to filenames, and then add transformers from these existing formats to the new manifest formats. That way we encapsulate all knowledge about how the filenames map to sample ids in those transformers, and actions that need these data can use the manifest formats and have explicit access to sample ids. This would be similar to how we handle the "demux formats" in q2-types (see an example of how this is used here).

We would want to keep the existing formats as-is so that existing artifacts (generated pre- or post-alpha) would continue to work.

gregcaporaso commented 1 year ago

@ebolyen pointed out that we can add a method on the format that gives us tuples or a dict mapping sample ids to relative filepaths in the Archive, so that will achieve the same goal without having to create new manifest formats. So we should do that for the next release (2023.11), and take a pass to use these throughout in places where we're parsing filenames to access the sample ids.

misialq commented 1 year ago

Sounds good - thanks @gregcaporaso and @ebolyen! 🙏

gregcaporaso commented 11 months ago

Since this issue was opened in the context of the ContigSequencesDirFmt, let's update all code that is using that format (e.g., here) as needed to use the lookup that @colinvwood added in #57, and then close this one out.

@colinvwood, could you put this on your list? It's a nice-to-have for the 2023.11 release, but not essential.

colinvwood commented 11 months ago

@gregcaporaso what about other formats that could use this approach, ~~e.g. the PR @misialq linked above--should we open separate issues?~~ that is in fact its own issue not a PR, whoops

colinvwood commented 11 months ago

After looking at the places in q2-assembly where we could use this new approach, I think that things will be much more refactorable once we have corresponding functionality for the mag directory format(s).

gregcaporaso commented 11 months ago

Ok, makes sense. Do we have an issue for updating the MAG directory format(s)? Could you create one if not?

colinvwood commented 11 months ago

Changes in q2-assembly are (at least partially) blocked by bokulich-lab/q2-assembly#66

bokulich-lab / q2-types-genomics

add formats that make sample (or other) ids explicit #56