dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

OHI-Science #52

Open joehand opened 8 years ago

joehand commented 8 years ago

From @karissa on March 26, 2015 21:21

https://github.com/OHI-Science uses http://iucnredlist.org/

As public data is updated year to year, people must update their scripts, see what changed, and sometimes they even write very long scripts to do things like remove duplicates.

They have a dataset of "truth" that they update every year. There's a column that says what year the row was modified. But there's no data on how it was modified (added?). Sometimes, the data is changed without updating the 'modified' column.

They built this script that pulls out rows that have a modified year of 2013, 2014 and then compare them https://github.com/OHI-Science/ohiprep/blob/master/Global/NCEAS-SpeciesDiversity_v2014/ingest_iucn.R

With dat + a visual diffing tool, they wouldn't have to go through all of this trouble to find what changed. They could just import the data for the new year.

@jafflerbach @jules32

Copied from original issue: maxogden/dat#290

joehand commented 8 years ago

From @karissa on April 6, 2015 20:46

I'm going to walk through a theoretical case study that took place in the near distance future, inspired by this repository: https://github.com/OHI-Science/ohiprep

from the user's perspective

"I am a data scientist that primarily uses R, to analyze data. I have them checked into git, but it can take a long time to clone, push, and pull the data stored there. My colleagues are primarily domain experts, and they use R. I end up having to create custom export scripts to get the data cleaned up enough so everyone else can analyze it. Others detached from our day-to-day operations might not use git, but will often want reports from our analysis."

dat

"I'd like to use dat because it sounds like it'll make my life easier when data updates. It'd be nice to track where the data has changed without intensive data engineering scripts."

current operation

"Right now, we have a git repository. It can take a few minutes to push/pull sometimes depending on the connection. We've tried to organize the data in the best way we can based on high-level subject areas."

exploration

$ ls -a 
.gitignore .git/ Global/ iceland/ antarctica/ 

So, what happens when we go into one of these high level subject areas?

$ cd Global
$ ls 
HS_AQ_Pressures_2014/   
HS_AQ_Pressures_HD_SB_2014/
NCEAS-Fisheries_2014a/
NCEAS-Fisheries_2014b/
etc...

Each of these directories is essentially a dataset but there might be multiple csvs inside the folder. Right now dat only handles one table per named 'dataset'.

https://github.com/OHI-Science/ohiprep/tree/master/Global/HS_AQ_Pressures_2014

$ cd HS_AQ_Pressures_2014/  
$ dat add <some data>

How do we handle complex cases like this?

joehand commented 8 years ago

From @karissa on April 6, 2015 21:31

see above link issue #295 for a proposal on how to handle this case