claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Better associations between metadata and sequence data #37

Closed claraqin closed 3 years ago

claraqin commented 3 years ago

Many of the function revisions currently being developed require the user to specify a metadata file that will be referenced to determine which files to process for a given step, e.g. matchFastqToMetadata(). (This is an alternative to parsing the filenames for this information, as the original functions did.) While this new convention is more robust to differences in file naming convention, it also means that there ought to be an easy way to remember which metadata files are associated with which set(s) of fastq files. Currently, metadata files which are downloaded to the sequence_metadata subdirectory are simply identified by a timestamp.

@zoey-rw suggested having one "complete" copy of the metadata, containing the metadata for all soil microbe marker gene sequence records. This would be compatible with any fastq files that we attempt to match to it, and would remain in a static location (perhaps defined by params.R) so it can be easily accessed.

claraqin commented 3 years ago

During today's meeting, we decided that it would make sense to have the downloadSequenceMetadata function write to params.R the location of the most recent sequence metadata file. This would then be read in the subsequent vignettes (Process 16S/ITS Sequences, and Add Environmental Variables), and it would be modifiable by the user.

This brings up two questions for me (and maybe @lstanish and @zoey-rw have ideas):

  1. This is a relatively minor workflow decision, but what are your thoughts on having the metadata saved to ./data/sequence_metadata/ after being run through qcMetadata, so that when it is read from file into subsequent vignettes, it does not have to have qcMetadata run on it again?
  2. What are your thoughts on having a copy of the particular sequence metadata file saved to the ./data/outputs folder containing the fastq files that it processed? Does this aid in reproducibility, or does it not really matter since the sequence tables contain dnaSampleIDs?
claraqin commented 3 years ago

Following up on the points above, what do you think of this mock-up for restructuring the directories? Things written in black represents parts of the current structure, and things written in red represent changes to that structure.

IMG_1140

The reliance on "processing batches" would also motivate functions for batch creation and management, such as newBatch(), listBatches(), and setBatch().

This restructuring would accomplish the following:

I want to be cautious and not create "feature bloat" with this, though. This issue was opened because I noticed that the lack of any infrastructure to associate sequence metadata with its outputs could lead to some reproducibility headaches, and because I think the metadata-handling capabilities are one of our package's main value-adds. To me, this alone justifies a batch system. But then I started thinking of other factors that could get in the way of reproducibility, and which might benefit from being identified as different batches. Where do we stop?

  1. Differences between sequence metadata files
  2. Changes to taxonomic reference files
  3. Changes to other sample data inputs, like NEON soil data
  4. Changes to NEON data structures
  5. Changes in processing parameters, like maxEE or truncLen
  6. Updates to the R package or its dependencies

Thanks in advance for your input, everyone!

Clara

claraqin commented 3 years ago

(Mostly a note to self) This R package might provide some of the capabilities mentioned in my last message, so that whatever feature we create here doesn't have to take everything into account: https://github.com/cboettig/neonstore

lstanish commented 3 years ago

@claraqin I think the directory restructuring makes sense, and I like the idea of creating batch IDs for unique processing batches. The format for naming processing batches isn't explicitly stated here, are you thinking a time stamp for the batch ID? Regarding how to handle metadata pre- and post-qc checks, we still want the un-qc'ed metadata to be downloaded just in case a user wants to re-run the qc script using different parameters, or wants to see what records were lost. So, I think it makes sense to expect a workflow in which both the original and qc'ed metadata were saved locally. In terms of folder naming, it might make sense to rename the 'sequence_metadata' folder 'raw_metadata' instead, which would be consistent with 'raw_sequences'. That would also make sense with your suggested workflow to have qc-ed metadata go into the 'outputs' folder with the processed sequence files for that batch ID. Then downstream processing code would just have to look into the outputs/ folder for that batch ID, which would streamline the workflow. My 2 cents!

claraqin commented 3 years ago

@lstanish sorry for the slow response to this! I think your suggestions make a lot of sense. The one thing I'm hesitant about is the timing of when the processing batches are created vs. when the metadata is QC'd. I was imagining that the when a processing batch is created, it must specify a pre-existing metadata object, so it would make sense to have a folder that contains both raw and QC'd versions of the metadata.

For the batch ID, I had been making it so that the user would have to specify an ID themselves, but I like the idea of having it be the time stamp, at least by default!

claraqin commented 3 years ago

Addressed with merge of batch_structure branch.