Chicago / open-data-etl-utility-kit

Use Pentaho's open source data integration tool (Kettle) to create Extract-Transform-Load (ETL) processes to update a Socrata open data portal. Documentation is available at http://open-data-etl-utility-kit.readthedocs.io/en/stable
Other
95 stars 30 forks source link

Automated way to set good column widths for new datasets #27

Open levyj opened 9 years ago

levyj commented 9 years ago

New datasets often have column widths that are too narrow or too wide. It is possible to get and change the widths, using the views API (e.g., https://data.cityofchicago.org/api/views/ydr8-5enu). This raises the possibility of creating a tool to set widths for a new dataset.

Setting the widths probably is not that hard. The more challenging part of this idea would be figuring out what the widths should be. This is a function both of the column name and the longest value, as well as some subjective preferences about whether either needs to be fully displayed or if more columns, but truncated, are a better use of screen width. However, any solution need not be perfect. If it does a pretty good job so that it cuts the manual adjustments necessary, that is still of value.

Our best (really crude) formula so far is:

Width = (4.7 * Number of Characters in Column Name) + 127

We are happy to share more information with anyone who cares to take on this project.

shua123 commented 9 years ago

This is a good idea. Thinking a bit out loud about how to do this... Finding existing widths is easy through the views api. I am not sure what publishing API/method would allow you to set the width.

Without digging into the code or functionality, I wonder if a modified version of DataSync's Port Job might be a path to go. I am not sure if it copies width when it copies schema and metadata.

levyj commented 9 years ago

You can modify the JSON and PUT it back to the same endpoint. It's not 100% encouraged but Socrata knows we do it. That is actually how https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q and https://data.cityofchicago.org/Service-Requests/Potholes-Patched-Last-Seven-Days/xpdx-8ivx get updated every day.

Note that for changes to datasets (as opposed to the two links above which are views of datasets), you have to PUT to a working copy.

Automation aside, editing the JSON manually can be a much faster way of mass-editing things like column descriptions than using the Web interface.

As far as DataSync or some other Socrata tool, that would be great if Socrata wanted to do it. I think Port Jobs do now copy the width but I am not sure.

shua123 commented 9 years ago

Interesting. Do you do that through a Kettle job?

levyj commented 9 years ago

Yes.

tomschenkjr commented 9 years ago

Maybe @chrismetcalf has thoughts on this idea