Quality measures for open datasets in data portals

codeforamerica / project-ideas

A place to collect ideas for CfA health projects

41 stars 10 forks source link

Quality measures for open datasets in data portals #48

Open hampelm opened 9 years ago

hampelm commented 9 years ago

Open datasets often have serious data quality problems that only surface once you start to use them. It would be so nice to have simple data quality measures available at a per-column level that would tell you if there is missing or malformed data, huge outliers, non-unique IDs, or other potential issues. Here's a quick prototype for Chicago:

screen shot 2014-12-12 at 6 52 59 pm

Mr0grog commented 9 years ago

Is that just a mockup or is there running code powering it?

What are some easy measures to start with?

How many rows are populated
Are values unique
How many unique values/duplicate values
Are all values of the same type
Basic stats for numeric/date columns (maybe less easy)
- Range
- Median
- Mean
- Std. dev.
- Outliers

Are there others?

Couple other crazy ones:

For lat/longs
- Are they within geographic boundaries if the city/admin area the data portal covers
- Or at least are they not (0,0) which is almost certainly an error
Can addresses be geocoded and do they fit above lat/long criteria
If a column should/could match up to one in another table, does it (I.e. Is it foreign-key-able)

Mr0grog commented 9 years ago

Couple others:

White space at start/end of value
Values that differ only by case/spaces/punctuation
For particular fields, does data follow a given format (detect a format or maybe match a predefined format?)
How recently was the data set updated?
If we can get the data, is the frequency of updates regular?
If a field is/appears to be a date, are all rows formatted the same? (e.g. Unix timestamp, ISO8601, something else, TZ included if not a Unix timestamp, is TZ same for all rows)

Not columnar, but is there a unique column that identifies each row?

hampelm commented 9 years ago

No running code behind this; just a kinda broken HTML playground.

In general, it looks like datasets on Socrata need to be pumped into something like csvstat to get those overview stats you listed:

The SODA API can't do a distinct query yet (July 2014) so that info needs to come externally.

You can get a type from SODA but if it's a number stored as a string, that won't help. That one is on the dataset maintainer to check.

SODA can do avg, min, max, sum, and count on groups, but I don't know if the 50,000 record query limit applies to those.

marks commented 9 years ago

I love this idea and it's something I personally would like to see built into Socrata in the future (I work for Socrata and have advocated for it). It would be great if something could be built to take a Socrata URL and returns statistics/quality "scores" like the ones discussed above.

To my knowledge, SODA will, indeed, need to be used to slurp the data into a secondary data store for the analysis to be performed, and then the results displayed to the user on the web and/or via API.

I might try to start hacking on this idea this weekend/over the holidays. If anyone else does too, please update this thread.

marks commented 9 years ago

@hampelm with regards to your question about the 50,000 limit, the limit applies to rows returned and not rows analyzed. So you can do a GROUP BY on a 800,000 row data set that results in 40,000 rows and you should be fine.

Mr0grog commented 9 years ago

You can get a type from SODA but if it's a number stored as a string, that won't help. That one is on the dataset maintainer to check.

Yeah, that’s why I started using the “is/appears to be” language ;)

Also another good one with dates: do they seem reasonable? For example, https://github.com/datamade/open-data-quality-control/issues/7 (Obviously if a city were to load up some historical data, this is harder, but I think that’s often unlikely.)