new data integrity checks

gregcaporaso commented 7 years ago

[ ] naming of OTU directories should be of the format: <similarity-threshold>-otus
[ ] expected sequences files should be named expected-sequences.fasta, and no other fasta files should be present in those directories

nbokulich commented 7 years ago

I agree with item 1 but have some "devil's advocate" questions regarding item 2.

In some ways, the source directory could be useful as a sort of "junk drawer" for the mock community, and contributors could include other information that don't it elsewhere. For example, a list of Genbank accession #s for whole genome sequences (which might not be appropriate in the "expected taxonomy" directories that are specific for reference databases that provide taxonomy information). Of course, we have control over this so the files would never be "junk", just a collection of useful files that do not fit in the other directories (which are more regulated).

Naming conventions in source could also have some flexibility. For example, expected-sequences.fasta can be rather vague — instead, full-length-16S-expected-sequences.fasta or V4-domain-expected-sequences.fasta could be more informative.

What do you think?

gregcaporaso commented 7 years ago

I think that all makes sense, I'm good with it.

nbokulich commented 7 years ago

What should we do for shotgun metagenome datasets? I think I support keeping the <similarity-threshold>-otus requirement across the board for simplicity's sake, and such datasets could be labeled 100-otus. But would it be better to enforce this rule only for marker-gene datasets, and use different rules for metagenome datasets?

gregcaporaso commented 7 years ago

I think your suggestions would work well.

nbokulich commented 7 years ago

Thanks! I will make that rule standard then when I update the integrity checks.

caporaso-lab / mockrobiota

new data integrity checks #54