codeforamerica / project-ideas

A place to collect ideas for CfA health projects
41 stars 10 forks source link

Quality measures for open datasets in data portals #48

Open hampelm opened 9 years ago

hampelm commented 9 years ago

Open datasets often have serious data quality problems that only surface once you start to use them. It would be so nice to have simple data quality measures available at a per-column level that would tell you if there is missing or malformed data, huge outliers, non-unique IDs, or other potential issues. Here's a quick prototype for Chicago:

screen shot 2014-12-12 at 6 52 59 pm

Mr0grog commented 9 years ago

Is that just a mockup or is there running code powering it?

What are some easy measures to start with?

Are there others?

Couple other crazy ones:

Mr0grog commented 9 years ago

Couple others:

Not columnar, but is there a unique column that identifies each row?

hampelm commented 9 years ago

No running code behind this; just a kinda broken HTML playground.

In general, it looks like datasets on Socrata need to be pumped into something like csvstat to get those overview stats you listed:

The SODA API can't do a distinct query yet (July 2014) so that info needs to come externally.

You can get a type from SODA but if it's a number stored as a string, that won't help. That one is on the dataset maintainer to check.

SODA can do avg, min, max, sum, and count on groups, but I don't know if the 50,000 record query limit applies to those.

marks commented 9 years ago

I love this idea and it's something I personally would like to see built into Socrata in the future (I work for Socrata and have advocated for it). It would be great if something could be built to take a Socrata URL and returns statistics/quality "scores" like the ones discussed above.

To my knowledge, SODA will, indeed, need to be used to slurp the data into a secondary data store for the analysis to be performed, and then the results displayed to the user on the web and/or via API.

I might try to start hacking on this idea this weekend/over the holidays. If anyone else does too, please update this thread.

marks commented 9 years ago

@hampelm with regards to your question about the 50,000 limit, the limit applies to rows returned and not rows analyzed. So you can do a GROUP BY on a 800,000 row data set that results in 40,000 rows and you should be fine.

Mr0grog commented 9 years ago

You can get a type from SODA but if it's a number stored as a string, that won't help. That one is on the dataset maintainer to check.

Yeah, that’s why I started using the “is/appears to be” language ;)

Also another good one with dates: do they seem reasonable? For example, https://github.com/datamade/open-data-quality-control/issues/7 (Obviously if a city were to load up some historical data, this is harder, but I think that’s often unlikely.)