ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

Add descriptive statistics to data dictionary #196

Open jqnatividad opened 7 years ago

jqnatividad commented 7 years ago

Woohoo! The semi-auto, resource-level data dictionary is here! https://github.com/ckan/ckan/pull/3414

We should consider expanding the data dictionary with more descriptive statistics by using a tool like https://csvkit.readthedocs.io.

Perhaps, as a datapusher "post-processor", we can call csvstats asynchronously to compile descriptive statistics.

For example, this sample file from data.boston.gov (rptcityscoresummary.csv.zip taken Mar 28), we got this csvstat report (rptcityscoresummary-descstats.txt ). It took a minute on a MacBook 2016 to scan the file and create the report.

  1. "CTY_SCR_NAME"

    Type of data:          Text
    Contains null values:  False
    Unique values:         22
    Longest value:         34 characters
    Most common values:    CITY SERVICES SATISFACTION SURVEYS (313x)
                           GRAFFITI ON-TIME % (313x)
                           MISSED TRASH ON-TIME % (313x)
                           PARKS MAINTENANCE ON-TIME % (313x)
                           POTHOLE ON-TIME % (313x)

  2. "CTY_SCR_NBR_DY_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1673
    Smallest value:        0
    Largest value:         11,854
    Sum:                   2,662,821.802
    Mean:                  445.586
    Median:                1
    StDev:                 1,913.897
    Most common values:    1 (1100x)
                           None (620x)
                           0 (523x)
                           2 (113x)
                           0.667 (79x)

  3. "CTY_SCR_NBR_DY_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         492
    Smallest value:        1
    Largest value:         3,295
    Sum:                   463,537
    Mean:                  122.499
    Median:                16
    StDev:                 301.434
    Most common values:    None (2812x)
                           4 (157x)
                           6 (152x)
                           5 (145x)
                           8 (143x)

  4. "CTY_SCR_NBR_WK_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         3141
    Smallest value:        0
    Largest value:         10,138.5
    Sum:                   2,959,883.164
    Mean:                  465.025
    Median:                0.94
    StDev:                 1,912.761
    Most common values:    1 (255x)
                           None (231x)
                           0 (193x)
                           0.143 (91x)
                           0.857 (84x)

  5. "CTY_SCR_NBR_WK_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         944
    Smallest value:        1
    Largest value:         14,056
    Sum:                   2,540,635
    Mean:                  622.4
    Median:                73
    StDev:                 1,466.989
    Most common values:    None (2514x)
                           39 (46x)
                           43 (45x)
                           51 (45x)
                           29 (42x)

  6. "CTY_SCR_NBR_MO_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         4781
    Smallest value:        0
    Largest value:         9,624.613
    Sum:                   2,976,713.567
    Mean:                  462.151
    Median:                0.934
    StDev:                 1,906.679
    Most common values:    None (155x)
                           0.097 (63x)
                           1 (22x)
                           0.129 (21x)
                           0.065 (19x)

  7. "CTY_SCR_NBR_MO_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1566
    Smallest value:        11
    Largest value:         33,897
    Sum:                   10,682,411
    Mean:                  2,610.56
    Median:                290
    StDev:                 5,961.668
    Most common values:    None (2504x)
                           141 (21x)
                           169 (21x)
                           180 (20x)
                           253 (20x)

  8. "CTY_SCR_NBR_QT_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         5218
    Smallest value:        0.044
    Largest value:         9,408.122
    Sum:                   2,982,826.846
    Mean:                  459.462
    Median:                0.926
    StDev:                 1,898.813
    Most common values:    None (104x)
                           0.12 (44x)
                           6.533 (32x)
                           6.267 (29x)
                           0.109 (26x)

  9. "CTY_SCR_NBR_QT_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         2084
    Smallest value:        52
    Largest value:         78,741
    Sum:                   31,401,398
    Mean:                  7,673.851
    Median:                851
    StDev:                 17,285.547
    Most common values:    None (2504x)
                           469 (13x)
                           725 (13x)
                           663 (13x)
                           858 (12x)

 10. "CTY_SCR_TGT_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         8
    Smallest value:        0.75
    Largest value:         95
    Sum:                   29,202.15
    Mean:                  6.631
    Median:                0.8
    StDev:                 21.411
    Most common values:    0.8 (2508x)
                           None (2192x)
                           0.95 (382x)
                           4 (340x)
                           6 (313x)

 11. "CTY_SCR_AVG_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1507
    Smallest value:        0.05
    Largest value:         8,826.548
    Sum:                   2,575,791.761
    Mean:                  1,175.624
    Median:                56.053
    StDev:                 2,656.83
    Most common values:    None (4405x)
                           0.113 (29x)
                           0.097 (26x)
                           0.15 (25x)
                           0.129 (24x)

 12. "CTY_SCR_AVG_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1785
    Smallest value:        0.087
    Largest value:         8,633.411
    Sum:                   2,542,310.677
    Mean:                  1,160.343
    Median:                55.594
    StDev:                 2,618.471
    Most common values:    None (4405x)
                           0.141 (17x)
                           0.136 (17x)
                           0.143 (14x)
                           0.132 (13x)

 13. "CTY_SCR_DEV_01"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1860
    Smallest value:        0.22
    Largest value:         1,967.8
    Sum:                   485,499.32
    Mean:                  221.588
    Median:                9.834
    StDev:                 523.541
    Most common values:    None (4405x)
                           0.319 (22x)
                           0.303 (14x)
                           0.349 (12x)
                           0.298 (12x)

 14. "CTY_SCR_DEV_02"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         1983
    Smallest value:        0.301
    Largest value:         1,788.384
    Sum:                   493,031.198
    Mean:                  225.026
    Median:                10.173
    StDev:                 527.044
    Most common values:    None (4405x)
                           0.309 (11x)
                           0.342 (8x)
                           0.341 (8x)
                           0.347 (7x)

 15. "ETL_LOAD_DATE"

    Type of data:          DateTime
    Contains null values:  False
    Unique values:         313
    Smallest value:        2016-01-15 00:00:00
    Largest value:         2017-03-28 00:00:00
    Most common values:    2017-03-23 00:00:00 (22x)
                           2017-03-24 00:00:00 (22x)
                           2017-03-10 00:00:00 (22x)
                           2017-03-13 00:00:00 (22x)
                           2017-03-15 00:00:00 (22x)

 16. "ETL_LOAD_IS_ACTIVE_FLAG"

    Type of data:          Boolean
    Contains null values:  False
    Unique values:         2
    Most common values:    False (6574x)
                           True (22x)

 17. "CTY_SCR_OPEN_DATA_SOURCE"

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         5
    Longest value:         100 characters
    Most common values:    https://data.cityofboston.gov/City-Services/311-Service-Requests/awu8-dc52 (2136x)
                           https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx (2136x)
                           None (1790x)
                           https://data.cityofboston.gov/Permitting/Approved-Building-Permits/msk6-43c6 (267x)
                           https://data.cityofboston.gov/City-Services/Boston-Public-Library-Daily-Active-User-Counts/mzws-sfys (267x)

 18. "CTY_SCR_METRIC_TYPE"

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         3
    Longest value:         10 characters
    Most common values:    Target (3761x)
                           Historical (1869x)
                           None (966x)

 19. "CTY_SCR_DAY"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         2818
    Smallest value:        0
    Largest value:         15.061
    Sum:                   6,335.908
    Mean:                  1.158
    Median:                1.083
    StDev:                 0.489
    Most common values:    None (1125x)
                           1.25 (870x)
                           0.938 (77x)
                           1 (77x)
                           0.833 (63x)

 20. "CTY_SCR_WEEK"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         4195
    Smallest value:        0.25
    Largest value:         61.434
    Sum:                   7,785.7
    Mean:                  1.261
    Median:                1.082
    StDev:                 1.048
    Most common values:    None (424x)
                           1.25 (181x)
                           0.938 (32x)
                           0.833 (25x)
                           1 (25x)

 21. "CTY_SCR_MONTH"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         5516
    Smallest value:        0.362
    Largest value:         20.055
    Sum:                   9,059.189
    Mean:                  1.408
    Median:                1.081
    StDev:                 1.411
    Most common values:    None (161x)
                           0.968 (19x)
                           0.97 (18x)
                           0.93 (16x)
                           0.952 (15x)

 22. "CTY_SCR_QUARTER"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         5846
    Smallest value:        0.705
    Largest value:         10.348
    Sum:                   8,639.596
    Mean:                  1.331
    Median:                1.079
    StDev:                 0.85
    Most common values:    None (104x)
                           0.918 (32x)
                           0.957 (29x)
                           0.96 (21x)
                           0.976 (21x)

 23. "CTY_SCR_DAY_NAME"

    Type of data:          Date
    Contains null values:  False
    Unique values:         5
    Smallest value:        0001-01-02
    Largest value:         0001-01-08
    Most common values:    0001-01-05 (1328x)
                           0001-01-04 (1328x)
                           0001-01-08 (1328x)
                           0001-01-03 (1306x)
                           0001-01-02 (1306x)

Row count: 6596
jqnatividad commented 7 years ago

Flagged for implementation:

Workflow:

Post initial implementation:

cc @wardi @davidread