benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Directory data and rewrite #94

Closed spholmes closed 8 years ago

spholmes commented 8 years ago

Normally sequence data should be fastq files that lie in a data/ directory, this directory should not be writeable by the dada2 functions, another directory: output/ should be the one that is used for the intermediate data. This avoids overwriting and is considered better practice. Can we talk about this?

benjjneb commented 8 years ago

Are you referring to how to alter the F1000 workflow? That seems like a good idea.

spholmes commented 8 years ago

Yes but more in general for tutorials etc...best practices have been delineated.

./data/ contains unwriteable data ./ouput/ is rewriteable

On Tue, Jul 19, 2016 at 11:00 AM, benjjneb notifications@github.com wrote:

Are you referring to how to alter the F1000 workflow? That seems like a good idea.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/94#issuecomment-233715306, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvTeYp6ePScYJzfIihvQHqbzWwjNfks5qXRDDgaJpZM4JP_ev .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

joey711 commented 8 years ago

I actually modified your standard DADA2 tutorial for SISMID for exactly this reason. Also because it helps the students understand they're writing the filtered sequences to disk as new files.

For dada2-package it might make sense to have the sequence filtering+trimming wrapper function throw an error if you're attempting to overwrite the input sequence file. I don't mind writing to the same directory, as long as there's a check in place to protect against overwriting my original data (though separate directory is probably even better). Since the file-write has to be "w" in the first chunk, and "a" from there on (because we're streaming chunks to control peak memory), an extra up front protection against writing over the input, the error-check protection seems all the more necessary. Otherwise one might accidentally provided the same "input/" and "output/" directory and end up overwriting their files anyway.

benjjneb commented 8 years ago

The tutorial was rewritten to create and use a filtered/ subdirectory for the filtered fasta files, so they no longer share a directory with the raw files: 06398db9b522e3b9647b534cdf87bc4d314b0085

benjjneb commented 8 years ago

Filter functions now check for and forbid identical input/output file paths: 7dcb444b8681370089c68d48d14dc925bb57d04c