Data4Democracy / drug-spending

Project to understand pharmaceutical spending, currently focused on US government programs.
73 stars 46 forks source link

Inventory gathered data and document relationships among sets #52

Open mattgawarecki opened 7 years ago

mattgawarecki commented 7 years ago

Status

The details of this issue are currently being discussed in the comments below. This issue may contain elements where development work is helpful, but is not primarily code-driven.

Task

We should take a look at all the data we've gathered to document how and by which fields various datasets are interconnected.

What we're looking for

To ensure it fits in with all our existing documentation, the result of work on this issue should go into a Markdown file in the /docs directory of the repo. This file should list out the following:

Optionally, it would be nice to have a graphical representation of how our datasets interconnect. This can be done programmatically, through the use of a graph visualization tool, or manually.

How this will help

Knowing which data sets are related makes it much easier for people to think about what insights can be gathered from them. It also identifies gaps in our understanding of the data we have and shows us what we should try to collect in the future.

jenniferthompson commented 7 years ago

Definitely needed. Should this replace, be in addition to, or be combined with our current /datadictionaries README (which needs some updating)?

skirmer commented 7 years ago

I think this file we're compiling with the one-line descriptions should link to the respective data dictionaries.

Edit: I gotta learn to read. I think that this might be more complicated than the Readme needs to be exactly, because of the specific info about linkages we're trying to build. But I could be wrong!

jenniferthompson commented 7 years ago

@skirmer Yep, I agree that this is trying to do more. Just trying to figure out what the role of each (if we keep both) would be!

I could see:

or possibly

I currently have no opinions on which would be better! I like streamlined, so that would mean just having the README with everything we need, but that might be trying to do too much in one spot.

skirmer commented 7 years ago

As we were discussing in the slack channel, just to document it here, ggraph might be a good tool to use to illustrate the links between our datasets in a visual fashion.

sharonbrener commented 7 years ago

Hey all! I know all of the data for this project lives on data.world, and that file descriptions and labels have been added to most files, but I wasn't sure if y'all knew that you can also add column descriptions to note key/joinable fields. That seems like it would be a great option for the second bullet point mentioned, and would keep that information living alongside the descriptions that already live on DW (which seems to fulfill the first bullet point).

From a column's info overlay, you can add a description: screenshot 2017-02-27 09 18 24

We're also actively working on a new view that compiles all dictionaries for a dataset into a single view (@jenniferthompson just user-tested our prototype of this on Friday, actually 🙌).

I'd be happy to answer any questions around current functionality, share a preview of what's coming soon, or chat about any other feature requests from this team that we should consider building as part of our data dictionary initiative. As you might imagine, proper documentation is very near and dear to our hearts at data.world!

jenniferthompson commented 7 years ago

Definitely agreed - thanks for making sure we know about it, @sharonbrener!

(And I'm really excited about this coming-soon feature, guys! It looks awesome.)

sharonbrener commented 7 years ago

This issue does highlight that along with our new data dictionary view, we should prioritize a way to export data dictionaries as MD files so they can be added to the repo as well. I'll bring that note back to our team.

darya-akimova commented 6 years ago

I'm officially reviving this issue haha, Have some free time with the holidays and I'd like to spend it on creating a useful data inventory/data dictionary for all of the datasets collected.

darya-akimova commented 6 years ago

Submitted a pull request for data dictionary files that I created for most existing datasets on data.world.

Still to do: