PugetSoundClinic-PIT / Fare-Fair

MIT License
1 stars 0 forks source link

data considerations 5/21 #16

Open samMint opened 1 year ago

samMint commented 1 year ago

Two documents about dataset procedures below. I’m hoping that we can use these to think about how we want to structure and normalize our data when it comes time to combine everything we’ve been screening. I think there are some tricky (but fun) curation considerations right around the corner. I structured my thinking around 1. good curation practice (i.e., we want to have a balance between tractability and expressivity) and 2. making sure things are well organized for later extraction and analysis.

normalization I’m hoping that, as we update it, this can turn into a more polished rundown of our decisions surrounding standards, normalization, and cleaning. It’s a mini protocol, and we might end up with more than one as different versions of these datasets are created. This one addresses how we extract and combine the bibliographic information provided by our queried databases.

data dictionary I’ve started an extremely rough draft of a data dictionary. As we get closer to combining screened data, we’ll want to think about standards for variables, etc. I’ver written some comments in/attached to cells, but the normalization doc covers much of the same ground. I’ll update the data dictionary as we make more decisions.

-In terms of packaging (and eventually publishing) our FAIR-ified data, I’m gravitating towards using RO-Crate specifications, using this tool. I'll discuss with you both as we keep moving forward to see if this makes sense!