datamade / wopr-data

Deprecated: Scripts for creating and updating datasets for plenario
MIT License
2 stars 2 forks source link

Start collecting data about the data #2

Closed evz closed 10 years ago

evz commented 10 years ago

I'm thinking it would be pretty valuable to have a brief description and a bit of metadata for each dataset that we are offering. At least for the Socrata datasets, we can probably do pretty well by just grabbing the description here:

screenshot 2014-01-03 08 57 55

Not sure if there's a simpler way of getting that other than just scraping it off that page. Other than that, @derekeder mentions a few other datapoints that we are hoping to either infer from what we have collected or collect separately over here: https://github.com/datamade/wopr-api/issues/1

brettjgoldstein commented 10 years ago

is it not available via the API? if it isn’t — i could have Socrata add it to the track.

On January 3, 2014 at 10:02:13 AM, Eric van Zanten (notifications@github.commailto://notifications@github.com) wrote:

I'm thinking it would be pretty valuable to have a brief description and a bit of metadata for each dataset that we are offering. At least for the Socrata datasets, we can probably do pretty well by just grabbing the description here:

[screenshot 2014-01-03 08 57 55]https://f.cloud.github.com/assets/551491/1839901/761ee6bc-7487-11e3-91d9-69027902bb73.png

Not sure if there's a simpler way of getting that other than just scraping it off that page. Other than that, @derekederhttps://github.com/derekeder mentions a few other datapoints that we are hoping to either infer from what we have collected or collect separately over here: datamade/wopr-api#1https://github.com/datamade/wopr-api/issues/1

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/wopr-data/issues/2.

evz commented 10 years ago

@brettjgoldstein I seem to remember it being there if you download the whole JSON blob but that's a bit expensive since you'd have to load the whole thing (1GB+ for the crime data) into memory just to get that one bit of info out. Would be cool to be able to fetch at least the stuff on, for example, this page: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/about without having to scrape it. Which I suppose is an option. But that just gives us more code to have to maintain.

evz commented 10 years ago

Turns out that this type of info is indeed available not only for the entire dataset, but for individual columns if you download the entire thing as JSON. Example:

"meta" : {
    "view" : {
      "id" : "ijzp-q8t2",
      "name" : "Crimes - 2001 to present",
      "attribution" : "Chicago Police Department",
      "attributionLink" : "https://portal.chicagopolice.org/portal/page/portal/ClearPath",
      "averageRating" : 0,
      "category" : "Public Safety",
      "createdAt" : 1317405571,
      "description" : "This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present,  ....<snip>..."
      "displayType" : "table",
      "downloadCount" : 24738,
      "indexUpdatedAt" : 1387606236,
      "newBackend" : false,
      "numberOfComments" : 0,
      "oid" : 6707502,
      "publicationAppendEnabled" : false,
      "publicationDate" : 1386275591,
      "publicationGroup" : 239997,
      "publicationStage" : "published",
      "resourceName" : "crimes",
      "rowClass" : "",
      "rowIdentifierColumnId" : 120515105,
      "rowsUpdatedAt" : 1389093062,
      "rowsUpdatedBy" : "scy9-9wg4",
      "signed" : false,
      "state" : "normal",
      "tableId" : 1388089,
      "totalTimesRated" : 0,
      "viewCount" : 62633,
      "viewLastModified" : 1386376229,
      "viewType" : "tabular",
      "columns" : [ {
        "id" : -1,
        "name" : "sid",
        "dataTypeName" : "meta_data",
        "fieldName" : ":sid",
        "position" : 0,
        "renderTypeName" : "meta_data",

    ... etc, etc ...

Interestingly, there is also info about the individual columns:

{
        "id" : 120515105,
        "name" : "ID",
        "dataTypeName" : "number",
        "fieldName" : "id",
        "position" : 1,
        "renderTypeName" : "number",
        "tableColumnId" : 2154841,
        "width" : 100,
        "cachedContents" : {
          "non_null" : 5428422,
          "smallest" : "634",
          "sum" : "28153429827217",
          "null" : 0,
          "average" : "5186300.885822252",
          "largest" : "9449944",
          "top" : [ { ...listing of the top 20 columns...} ],
        "format" : {
          "precisionStyle" : "standard",
          "align" : "right",
          "noCommas" : "true"
        }

    ... etc, etc ...

If we could get just this info without the whole dataset through an endpoint someplace, that would make it pretty simple to dynamically create the tables in the database.

derekeder commented 10 years ago

Talked to Chris Metcalf at Socrata. The data endpoint we seek is the views page.

Example: https://data.cityofchicago.org/api/views/ijzp-q8t2.json (Crimes - 2001 to present)

brettjgoldstein commented 10 years ago

great

On January 7, 2014 at 3:37:08 PM, Derek Eder (notifications@github.commailto://notifications@github.com) wrote:

Talked to Chris Metcalf at Socrata. The data endpoint we seek is the views page.

Example: https://data.cityofchicago.org/api/views/ijzp-q8t2.json (Crimes - 2001 to present)

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/wopr-data/issues/2#issuecomment-31782173.

derekeder commented 10 years ago

Closing as this is now under https://github.com/datamade/wopr-ops/issues/15