Streamlining QAQC Workflows

jaxinewolfe commented 4 years ago

Here are two workflows to be considered for the data portal submissions QAQC process. Workflow A is currently implemented, however, I am presenting Workflow B as a potential variation. For example, there may be certain tests (presently or developed in the future) in which Workflow B would prove more efficient. Feel free to propose additional methods as well!

Workflow A:

Cycle through each protocol sheet
Perform QAQC
Aggregate results from each iteration
Output results

Workflow B:

Harvest relevant data/info from each sheet
Compile into one unique data frame, vector, or list
Perform QAQC
Output results

jslefche commented 4 years ago

I have been following protocol B personally. Combining into a single data.frame prevents having to re-address the same issues with data shared across separate sheets (eg, metadata such as lat/longs). You can then cycle over columns to implement specific QAQC procedures (eg, convert % cover to Braun-Blanquet bins for seagrass cover). Michael might have more thoughts on its feasibility within the current structure

Jonathan S. Lefcheck, Ph.D. Tennenbaum Coordinating Scientist MarineGEO: https://marinegeo.si.edu/ Smithsonian Institution Phone: +1 (443) 482-2443 www.jonlefcheck.nethttp://www.jonlefcheck.net

From: Jaxine Wolfemailto:notifications@github.com Sent: Thursday, May 28, 2020 2:55 PM To: MarineGEO/data_portalmailto:data_portal@noreply.github.com Cc: Subscribedmailto:subscribed@noreply.github.com Subject: [MarineGEO/data_portal] Streamlining QAQC Workflows (#37)

External Email - Exercise Caution

Here are two workflows to be considered for the data portal submissions QAQC process. Workflow A is currently implemented, however, I am presenting Workflow B as a potential variation. For example, there may be certain tests (presently or developed in the future) in which Workflow B would prove more efficient. Feel free to propose additional methods as well!

Workflow A:

Cycle through each protocol sheet
Perform QAQC
Aggregate results from each iteration
Output results

Workflow B:

Harvest relevant data/info from each sheet
Compile into one unique data frame, vector, or list
Perform QAQC
Output results

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMarineGEO%2Fdata_portal%2Fissues%2F37&data=02%7C01%7Clefcheckj%40si.edu%7C86324b8df6864e2be08208d80338a678%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C0%7C637262889143889234&sdata=0gbmu2y8ITdo8aOZZlK1l8wcSJOlgcvhTiKexTwkZSM%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAR4AV6QMDR6MJA7426CKGDRT2XQ7ANCNFSM4NNK2UMA&data=02%7C01%7Clefcheckj%40si.edu%7C86324b8df6864e2be08208d80338a678%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C0%7C637262889143889234&sdata=jLOZ0JmMmjOhbv3YUS41XDivNs90VEaT9rE86ZoARvI%3D&reserved=0.

mlonneman commented 4 years ago

The data submission process has been streamlined and the QC process reflects workflow A more than B. The portal handles data largely in a similar structure to how it’s uploaded: An Excel spreadsheet is read in as a list of tables, each table representing an Excel sheet. At the end of the submission process each table (sheet) is exported as a CSV. This makes it much simpler to maintain the link between the uploaded Excel file, output CSVs, and QC results, which is essential to track provenance and provide support to data providers. Earlier versions of the app may have split or combined individual tables based on site code and date, which both complicated code and occasionally led to issues where the metadata associated with each output table was mismatched or lost (where did the table originate, how should the CSV be named, and where should it go).

An important point is that QC tests are only inspecting data, not actively curating it. We’re evaluating to what degree uploaded tables match our schemas, which we need to know before any aggregation or curation is done. Ultimately, they evaluate the presence or absence of columns, use of invalid values or data types, and proper use of linking ID (key) variables across tables (or will evaluate as many tests are still under development). Any violation is flagged and linked to both the uploaded Excel file and the output CSV table. Actual curation is more effectively and reliably done in the data lake, or at minimum through an R script outside of Shiny, so that we have a record of lineage (we can track which specific tables, columns, rows, and values were updated).

Finally, there’s no computational benefit at this scale from aggregating data before running any tests. Sample metadata tables are likely going to be the only table that is uploaded multiple times (each submission will likely involve several unique protocols, not multiple uploads of a single protocol), and aggregating all sample metadata tables before testing site codes, coordinates, etc. will not be noticeably faster. In that sense, it seems more important to prioritize maintaining the structure of the uploaded data to reduce errors while still providing us a snapshot of the condition of the data and metadata.

MarineGEO / data_portal

Streamlining QAQC Workflows #37