edgi-govdata-archiving / ECHO-Cross-Program

Jupyter Notebooks for ECHO that use data from multiple EPA programs
https://colab.research.google.com/github/edgi-govdata-archiving/ECHO-Cross-Program/blob/master/ECHO-Cross-Programs.ipynb
GNU General Public License v3.0
8 stars 5 forks source link

Cell #12 should have a better presentation of the tabular data #68

Open shansen5 opened 3 years ago

Frijol commented 2 years ago

Curious what you have in mind here. Here's what I'm noticing:

skybristol commented 2 years ago

What I'm working on in the new ETL process (pulling data from EPA's downloads, transforming a little bit for use, and loading elsewhere - Postgres, etc.) should help with this. I'm working through the slight variation in how each of the datasets we are tapping are documented via web pages and PDF files to bring back the full descriptions of field names. I'm putting this into a technical encoding called JSONSchema that includes some extra technical details about the properties that will let us better validate the data values when we pull a fresh file. It seems like we should review through all of EPA's data documentation and then decide if we might have some value added annotation we could layer on to help people make better sense of the data. A lot of things like the various codes that tie things together are kind of hard to figure out without putting several pieces of information together, so we can probably shed some better light on this.

Technically, each distinct logical property in our transformation of the data will have an @id value in the JSONSchema structure. This will facilitate driving things like primary and foreign key relationships across the data in a SQL context like Postgres. I think it would be cool to incorporate extra annotations on top of this structure referring to the @id values. We might use the simplicity of yet another Google sheet somewhere to store and manage this information and then pull that into various presentations of the data we put online. If we do end up with more complex information than what we would put into a few sentences, we can look at referencing off to markdown files. It would be good to keep all the dots connected together between source documentation and our own additions.