Closed spholmes closed 8 years ago
Are you referring to how to alter the F1000 workflow? That seems like a good idea.
Yes but more in general for tutorials etc...best practices have been delineated.
./data/ contains unwriteable data ./ouput/ is rewriteable
On Tue, Jul 19, 2016 at 11:00 AM, benjjneb notifications@github.com wrote:
Are you referring to how to alter the F1000 workflow? That seems like a good idea.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/94#issuecomment-233715306, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvTeYp6ePScYJzfIihvQHqbzWwjNfks5qXRDDgaJpZM4JP_ev .
Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
I actually modified your standard DADA2 tutorial for SISMID for exactly this reason. Also because it helps the students understand they're writing the filtered sequences to disk as new files.
For dada2-package it might make sense to have the sequence filtering+trimming wrapper function throw an error if you're attempting to overwrite the input sequence file. I don't mind writing to the same directory, as long as there's a check in place to protect against overwriting my original data (though separate directory is probably even better). Since the file-write has to be "w" in the first chunk, and "a" from there on (because we're streaming chunks to control peak memory), an extra up front protection against writing over the input, the error-check protection seems all the more necessary. Otherwise one might accidentally provided the same "input/" and "output/" directory and end up overwriting their files anyway.
The tutorial was rewritten to create and use a filtered/
subdirectory for the filtered fasta files, so they no longer share a directory with the raw files: 06398db9b522e3b9647b534cdf87bc4d314b0085
Filter functions now check for and forbid identical input/output file paths: 7dcb444b8681370089c68d48d14dc925bb57d04c
Normally sequence data should be fastq files that lie in a data/ directory, this directory should not be writeable by the dada2 functions, another directory: output/ should be the one that is used for the intermediate data. This avoids overwriting and is considered better practice. Can we talk about this?