Open gregcaporaso opened 7 years ago
I agree with item 1 but have some "devil's advocate" questions regarding item 2.
In some ways, the source
directory could be useful as a sort of "junk drawer" for the mock community, and contributors could include other information that don't it elsewhere. For example, a list of Genbank accession #s for whole genome sequences (which might not be appropriate in the "expected taxonomy" directories that are specific for reference databases that provide taxonomy information). Of course, we have control over this so the files would never be "junk", just a collection of useful files that do not fit in the other directories (which are more regulated).
Naming conventions in source
could also have some flexibility. For example, expected-sequences.fasta
can be rather vague — instead, full-length-16S-expected-sequences.fasta
or V4-domain-expected-sequences.fasta
could be more informative.
What do you think?
I think that all makes sense, I'm good with it.
What should we do for shotgun metagenome datasets? I think I support keeping the <similarity-threshold>-otus
requirement across the board for simplicity's sake, and such datasets could be labeled 100-otus
. But would it be better to enforce this rule only for marker-gene datasets, and use different rules for metagenome datasets?
I think your suggestions would work well.
Thanks! I will make that rule standard then when I update the integrity checks.
<similarity-threshold>-otus
expected-sequences.fasta
, and no other fasta files should be present in those directories