Watts-Lab / commonsense-platform

The common sense platform, rate your common sense.
https://commonsense.seas.upenn.edu
1 stars 0 forks source link

Data drop automation #206

Open markwhiting opened 1 week ago

markwhiting commented 1 week ago

We want to drop data on some regularity onto the commonsense-data repo so that it can serve as a continuous release of data and tie into our registration paradigm.

Requirements:

  1. extra clean, consistent, and logical naming and scrubbed tables
  2. no PII — this data will be public from day 1.
  3. automated verified commits from
  4. human-readable files that can easily be diffed
  5. files less than 100MB each
  6. some protocol for deciding when to split into new files, e.g. every 1000 submissions or every day, whichever is more frequent.

I'm interested to discuss details here as we start setting it up.

markwhiting commented 1 week ago

This seems largely done, but it would be nice to think more about how to make the files a bit smaller. e.g. by making things that are repeated a lot (eg experiment information) into their own table that is referenced by relevant IDs.

One other point — it would be great if IDs were commensurable across the entire dataset. e.g. experimentID is always called experimentID