claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Metadata/downloading QC checklist #29

Open claraqin opened 3 years ago

claraqin commented 3 years ago

We need to make the following changes to the workflow, particularly in the Download NEON Data vignette, to prevent QC-related issues from complicating processes downstream.

  1. Metadata file(s) should be saved by default
  2. Check for pre-existing downloads
  3. Check for duplicate sample IDs (Due to either re-sampling vs labeling errors. In either case, choose which file to retain)
  4. Remove QC-flagged data
  5. Separate metadata into 16S or ITS - this could occur at the phyloseq step, or before downloading raw sequences

In addition, @lstanish suggests that it could be good to reorganize the columns in the metadata table so the most important columns come first. What are some columns to put first in the metadata table?

claraqin commented 3 years ago

@lstanish has completed 1 and 5 in the above checklist.

Just a thought about the column order in the metadata table: Because the metadata consists of several stackByTable csv's joined together, the columns are primarily organized by which csv they came from. For example, the first several columns all have to do with the raw data files, and the next several have to do with sequencing. We could revise the order in which we join the csv's so that the columns correspond to the order in which processing took place in the lab, i.e. raw data files, then DNA extraction, then PCR amplification, and finally marker gene sequencing. I'm more agnostic as to the order of columns within these broader groupings.

lstanish commented 3 years ago

@claraqin qcMetadata function ready for testing! Function is in the code folder. Currently functionality:

To add:

Other functionality that would be good to add:

Other tests to run:

zoey-rw commented 3 years ago

@claraqin @lstanish Tested this function and pushed a small change: the output is now a dataframe, which can be used the same way as the input dataframe.

As you referenced, the QC function cannot handle a test dataset that was generated using targetGene="all." Perhaps in that case, the QC function could use a loop to essentially run twice, creating a ITS output and a 16S output, and combining them back into an "all" dataframe (or outputting both separately in a list format).

lstanish commented 3 years ago

@zoey-rw Thanks for making that update to output a dataframe as well as a hard-copy file! It's good to know that's a useful output. I am curious to know how this function will behave if you use the params file to output the QCed data, did you happen to test that?

Regarding making the function useful for targetGene='all', is this a useful feature? I'm wondering because the data need to be parsed by targetGene for dada2 and all of the downstream analyses keep the 16S and ITS data separate. It's definitely possible and wouldn't be hard to allow the function to QC 16S and ITS in the same function call, just wondering whether that's something that users will want to do.

lstanish commented 3 years ago

@claraqin added in user option to remove records containing a NEON data flag (any of the qaqcStatus fields, and dataQF)

lstanish commented 3 years ago

@claraqin @zoey-rw Made one minor update to the error message if outDir="" and pushed the udpate. Any luck testing and of the un-checked items above?