Closed claraqin closed 3 years ago
During today's meeting, we decided that it would make sense to have the downloadSequenceMetadata
function write to params.R
the location of the most recent sequence metadata file. This would then be read in the subsequent vignettes (Process 16S/ITS Sequences, and Add Environmental Variables), and it would be modifiable by the user.
This brings up two questions for me (and maybe @lstanish and @zoey-rw have ideas):
./data/sequence_metadata/
after being run through qcMetadata, so that when it is read from file into subsequent vignettes, it does not have to have qcMetadata run on it again?./data/outputs
folder containing the fastq files that it processed? Does this aid in reproducibility, or does it not really matter since the sequence tables contain dnaSampleIDs?Following up on the points above, what do you think of this mock-up for restructuring the directories? Things written in black represents parts of the current structure, and things written in red represent changes to that structure.
The reliance on "processing batches" would also motivate functions for batch creation and management, such as newBatch()
, listBatches()
, and setBatch()
.
This restructuring would accomplish the following:
data/raw_sequence/
to subdirectories in outputs/
. This is useful for situations where the raw file permissions might be read-only, and is also more intuitive.params.R
, tentatively named params_[batchID].R
. (The local copy of params.R
would also contain the file path to the sequence metadata and other input files associated with the processing outputs.)params.R
or to some other log file.I want to be cautious and not create "feature bloat" with this, though. This issue was opened because I noticed that the lack of any infrastructure to associate sequence metadata with its outputs could lead to some reproducibility headaches, and because I think the metadata-handling capabilities are one of our package's main value-adds. To me, this alone justifies a batch system. But then I started thinking of other factors that could get in the way of reproducibility, and which might benefit from being identified as different batches. Where do we stop?
Thanks in advance for your input, everyone!
Clara
(Mostly a note to self) This R package might provide some of the capabilities mentioned in my last message, so that whatever feature we create here doesn't have to take everything into account: https://github.com/cboettig/neonstore
@claraqin I think the directory restructuring makes sense, and I like the idea of creating batch IDs for unique processing batches. The format for naming processing batches isn't explicitly stated here, are you thinking a time stamp for the batch ID? Regarding how to handle metadata pre- and post-qc checks, we still want the un-qc'ed metadata to be downloaded just in case a user wants to re-run the qc script using different parameters, or wants to see what records were lost. So, I think it makes sense to expect a workflow in which both the original and qc'ed metadata were saved locally. In terms of folder naming, it might make sense to rename the 'sequence_metadata' folder 'raw_metadata' instead, which would be consistent with 'raw_sequences'. That would also make sense with your suggested workflow to have qc-ed metadata go into the 'outputs' folder with the processed sequence files for that batch ID. Then downstream processing code would just have to look into the outputs/ folder for that batch ID, which would streamline the workflow. My 2 cents!
@lstanish sorry for the slow response to this! I think your suggestions make a lot of sense. The one thing I'm hesitant about is the timing of when the processing batches are created vs. when the metadata is QC'd. I was imagining that the when a processing batch is created, it must specify a pre-existing metadata object, so it would make sense to have a folder that contains both raw and QC'd versions of the metadata.
For the batch ID, I had been making it so that the user would have to specify an ID themselves, but I like the idea of having it be the time stamp, at least by default!
Addressed with merge of batch_structure
branch.
Many of the function revisions currently being developed require the user to specify a metadata file that will be referenced to determine which files to process for a given step, e.g.
matchFastqToMetadata()
. (This is an alternative to parsing the filenames for this information, as the original functions did.) While this new convention is more robust to differences in file naming convention, it also means that there ought to be an easy way to remember which metadata files are associated with which set(s) of fastq files. Currently, metadata files which are downloaded to thesequence_metadata
subdirectory are simply identified by a timestamp.@zoey-rw suggested having one "complete" copy of the metadata, containing the metadata for all soil microbe marker gene sequence records. This would be compatible with any fastq files that we attempt to match to it, and would remain in a static location (perhaps defined by
params.R
) so it can be easily accessed.