NDCLab / lab-devOps

NDCLab mgmt and operations
GNU Affero General Public License v3.0
0 stars 0 forks source link

Data Monitoring Pipeline Planning & Summary #198

Closed F-said closed 1 year ago

F-said commented 2 years ago

Purpose Why bother?

Encompasses Issues

190

189

186

185

182

181

180

Informal UML 10,000_feet drawio

Ideas

F-said commented 2 years ago

@georgebuzzell @jessb0t What are some priorities for our HPC datasets? I assume we have the following needs.

And I assume these are our wants:

Is there anything I am missing?

jessb0t commented 2 years ago

@F-said How are you defining "shareability" in this context?

F-said commented 2 years ago

@jessb0t shareable across our HPC and since datalad supports it, the internet

georgebuzzell commented 2 years ago

@F-said Great question. Here are the priorities (please feel free to ask clarification questions or challenge if some/all are really priorities):

Highest priority, "must haves"

  1. It should be impossible, or very difficult, to permanently delete data by accident. Or, if relatively easy to do, then it should be easy to recover. Note that this is not the same as having the ability to "revert" data to prior states. That is a separate feature and a very, very, nice to have, but not a need per se.
  2. Datasets should have strict control over access to identifiable data, and, as a general rule, datasets should be converted to deidentified formats as early in the processing stream as possible, and ALL lab members should be able to see de-identified data.
  3. Datasets should be created/maintained with the assumption that errors WILL OCCUR and that it is more likely the case than not that there will need to be corrections to data/datasets one or more times during the lifetime of a dataset (with the lifetime spanning collection and all analyses). Datasets, their organization, naming standards, etc should all be created in such a way that it is "easy" and never a "hassle" to correct mistakes. This includes updates/changes to data collection instruments (surveys, experiment code) during a study, as well as identifying/fixing errors in preprocessing scripts that then require preprocessed data to be reprocessed.
  4. Datasets must have at least a very basic data dictionary that list each type of data, the relevant naming convention, and very brief (one line) description of what it is.
  5. Datasets must have at least a very basic "data tracker", in which each row is a new participant, and columns indicate participant id, each type of data (from data dictionary) and, at a minimum, the corresponding cell in the tracker is coded to indicate if the data is missing or not, and if present, if it is checked or not, and if checked, if preprocessed or not. It is a must have for this tracker to exist. And a must have that there is at least a basic protocol describing how to manually update this tracker whenever data is added (or at a specified interval). A very nice to have is for this tracker to be automatically updated via scripts (e.g., the checking script can update the tracker).
  6. Datasets should allow for modular use/reuse. That is, it should be easy for multiple analysis projects to pull from the same dataset
  7. Datasets should follow a similar organizational structure (standard folder names, etc), similar naming conventions, and, data itself should use the same standard format (e.g. bids)
  8. Standard protocols (ideally scripts) should exist that can be periodically run to check for the presence of typos, duplicates, naming errors, missing data, lack of encryption, etc. Ideally, this same script(s) will also move the data from raw to checked. Note that it is a "must have" to at least have a protocol for how this is done. A very, very nice to have is that this is done via one or more scripts that are periodically run, manually by a study lead. Finally, a nice to have, but not a priority at all, is having such scripts run automatically.
  9. With the exception of raw data for a data collection project, which is often manually added, all data in a dataset is deterministically manipulated/moved/checked/transformed/processed by code, and the code that was used is present.

"very nice to have, but not must haves"

  1. A very nice to have (but not must) is to have datasets set up as datalad datasets that allows for tracking provenance, as well as ease of copying/etc.
  2. A very nice to have (but currently not a must) it should be fairly easy to push the deidentified parts of a dataset to a public repository somewhere

@F-said @jessb0t comments/questions welcome

jessb0t commented 2 years ago

@F-said : 3, 4, 5, and part of 8 are squarely on my plate. As discussed, we will work together so that I create these in lockstep with your technical work. @georgebuzzell Having played around with the data dictionary and central tracker, conceptually, for existing studies, I honestly do not think that we should take any time to create manual checks and updates to the tracker. I feel that it would take as much work to create manual processes (which are not scalable) as to create semi-automated processes (that is, basic script + protocol for the study lead to run it). The reason I feel this way is that, for example, the online studies have ~30 surveys, each of which has 10+ questions. Verifying that a given survey/score is available for a given participant is visually very difficult, and a manual process would be time-consuming and error-prone. Within the same project time-scale, I think we can get simple, semi-automated processes in place that prevent data monitoring from being a burden on study leads (and ensuring higher-quality trackers). I suspect that you and I are on the same page here (namely, utilize scripting in the immediacy to create a simple system that can be refined, improved, and made even more automated over time), but I wanted to be clear that the protocols I have started developing require some level of scripting (that is, there is no "fully manual" option because existing studies are so complex as to make such an option seemingly nonviable to me).