Open jqnatividad opened 7 years ago
Woohoo! The semi-auto, resource-level data dictionary is here! https://github.com/ckan/ckan/pull/3414
We should consider expanding the data dictionary with more descriptive statistics by using a tool like https://csvkit.readthedocs.io.
Perhaps, as a datapusher "post-processor", we can call csvstats asynchronously to compile descriptive statistics.
For example, this sample file from data.boston.gov (rptcityscoresummary.csv.zip taken Mar 28), we got this csvstat report (rptcityscoresummary-descstats.txt ). It took a minute on a MacBook 2016 to scan the file and create the report.
1. "CTY_SCR_NAME" Type of data: Text Contains null values: False Unique values: 22 Longest value: 34 characters Most common values: CITY SERVICES SATISFACTION SURVEYS (313x) GRAFFITI ON-TIME % (313x) MISSED TRASH ON-TIME % (313x) PARKS MAINTENANCE ON-TIME % (313x) POTHOLE ON-TIME % (313x) 2. "CTY_SCR_NBR_DY_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1673 Smallest value: 0 Largest value: 11,854 Sum: 2,662,821.802 Mean: 445.586 Median: 1 StDev: 1,913.897 Most common values: 1 (1100x) None (620x) 0 (523x) 2 (113x) 0.667 (79x) 3. "CTY_SCR_NBR_DY_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 492 Smallest value: 1 Largest value: 3,295 Sum: 463,537 Mean: 122.499 Median: 16 StDev: 301.434 Most common values: None (2812x) 4 (157x) 6 (152x) 5 (145x) 8 (143x) 4. "CTY_SCR_NBR_WK_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 3141 Smallest value: 0 Largest value: 10,138.5 Sum: 2,959,883.164 Mean: 465.025 Median: 0.94 StDev: 1,912.761 Most common values: 1 (255x) None (231x) 0 (193x) 0.143 (91x) 0.857 (84x) 5. "CTY_SCR_NBR_WK_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 944 Smallest value: 1 Largest value: 14,056 Sum: 2,540,635 Mean: 622.4 Median: 73 StDev: 1,466.989 Most common values: None (2514x) 39 (46x) 43 (45x) 51 (45x) 29 (42x) 6. "CTY_SCR_NBR_MO_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 4781 Smallest value: 0 Largest value: 9,624.613 Sum: 2,976,713.567 Mean: 462.151 Median: 0.934 StDev: 1,906.679 Most common values: None (155x) 0.097 (63x) 1 (22x) 0.129 (21x) 0.065 (19x) 7. "CTY_SCR_NBR_MO_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1566 Smallest value: 11 Largest value: 33,897 Sum: 10,682,411 Mean: 2,610.56 Median: 290 StDev: 5,961.668 Most common values: None (2504x) 141 (21x) 169 (21x) 180 (20x) 253 (20x) 8. "CTY_SCR_NBR_QT_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 5218 Smallest value: 0.044 Largest value: 9,408.122 Sum: 2,982,826.846 Mean: 459.462 Median: 0.926 StDev: 1,898.813 Most common values: None (104x) 0.12 (44x) 6.533 (32x) 6.267 (29x) 0.109 (26x) 9. "CTY_SCR_NBR_QT_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 2084 Smallest value: 52 Largest value: 78,741 Sum: 31,401,398 Mean: 7,673.851 Median: 851 StDev: 17,285.547 Most common values: None (2504x) 469 (13x) 725 (13x) 663 (13x) 858 (12x) 10. "CTY_SCR_TGT_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 8 Smallest value: 0.75 Largest value: 95 Sum: 29,202.15 Mean: 6.631 Median: 0.8 StDev: 21.411 Most common values: 0.8 (2508x) None (2192x) 0.95 (382x) 4 (340x) 6 (313x) 11. "CTY_SCR_AVG_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1507 Smallest value: 0.05 Largest value: 8,826.548 Sum: 2,575,791.761 Mean: 1,175.624 Median: 56.053 StDev: 2,656.83 Most common values: None (4405x) 0.113 (29x) 0.097 (26x) 0.15 (25x) 0.129 (24x) 12. "CTY_SCR_AVG_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1785 Smallest value: 0.087 Largest value: 8,633.411 Sum: 2,542,310.677 Mean: 1,160.343 Median: 55.594 StDev: 2,618.471 Most common values: None (4405x) 0.141 (17x) 0.136 (17x) 0.143 (14x) 0.132 (13x) 13. "CTY_SCR_DEV_01" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1860 Smallest value: 0.22 Largest value: 1,967.8 Sum: 485,499.32 Mean: 221.588 Median: 9.834 StDev: 523.541 Most common values: None (4405x) 0.319 (22x) 0.303 (14x) 0.349 (12x) 0.298 (12x) 14. "CTY_SCR_DEV_02" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 1983 Smallest value: 0.301 Largest value: 1,788.384 Sum: 493,031.198 Mean: 225.026 Median: 10.173 StDev: 527.044 Most common values: None (4405x) 0.309 (11x) 0.342 (8x) 0.341 (8x) 0.347 (7x) 15. "ETL_LOAD_DATE" Type of data: DateTime Contains null values: False Unique values: 313 Smallest value: 2016-01-15 00:00:00 Largest value: 2017-03-28 00:00:00 Most common values: 2017-03-23 00:00:00 (22x) 2017-03-24 00:00:00 (22x) 2017-03-10 00:00:00 (22x) 2017-03-13 00:00:00 (22x) 2017-03-15 00:00:00 (22x) 16. "ETL_LOAD_IS_ACTIVE_FLAG" Type of data: Boolean Contains null values: False Unique values: 2 Most common values: False (6574x) True (22x) 17. "CTY_SCR_OPEN_DATA_SOURCE" Type of data: Text Contains null values: True (excluded from calculations) Unique values: 5 Longest value: 100 characters Most common values: https://data.cityofboston.gov/City-Services/311-Service-Requests/awu8-dc52 (2136x) https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx (2136x) None (1790x) https://data.cityofboston.gov/Permitting/Approved-Building-Permits/msk6-43c6 (267x) https://data.cityofboston.gov/City-Services/Boston-Public-Library-Daily-Active-User-Counts/mzws-sfys (267x) 18. "CTY_SCR_METRIC_TYPE" Type of data: Text Contains null values: True (excluded from calculations) Unique values: 3 Longest value: 10 characters Most common values: Target (3761x) Historical (1869x) None (966x) 19. "CTY_SCR_DAY" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 2818 Smallest value: 0 Largest value: 15.061 Sum: 6,335.908 Mean: 1.158 Median: 1.083 StDev: 0.489 Most common values: None (1125x) 1.25 (870x) 0.938 (77x) 1 (77x) 0.833 (63x) 20. "CTY_SCR_WEEK" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 4195 Smallest value: 0.25 Largest value: 61.434 Sum: 7,785.7 Mean: 1.261 Median: 1.082 StDev: 1.048 Most common values: None (424x) 1.25 (181x) 0.938 (32x) 0.833 (25x) 1 (25x) 21. "CTY_SCR_MONTH" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 5516 Smallest value: 0.362 Largest value: 20.055 Sum: 9,059.189 Mean: 1.408 Median: 1.081 StDev: 1.411 Most common values: None (161x) 0.968 (19x) 0.97 (18x) 0.93 (16x) 0.952 (15x) 22. "CTY_SCR_QUARTER" Type of data: Number Contains null values: True (excluded from calculations) Unique values: 5846 Smallest value: 0.705 Largest value: 10.348 Sum: 8,639.596 Mean: 1.331 Median: 1.079 StDev: 0.85 Most common values: None (104x) 0.918 (32x) 0.957 (29x) 0.96 (21x) 0.976 (21x) 23. "CTY_SCR_DAY_NAME" Type of data: Date Contains null values: False Unique values: 5 Smallest value: 0001-01-02 Largest value: 0001-01-08 Most common values: 0001-01-05 (1328x) 0001-01-04 (1328x) 0001-01-08 (1328x) 0001-01-03 (1306x) 0001-01-02 (1306x) Row count: 6596
Flagged for implementation:
Workflow:
Post initial implementation:
cc @wardi @davidread
Woohoo! The semi-auto, resource-level data dictionary is here! https://github.com/ckan/ckan/pull/3414
We should consider expanding the data dictionary with more descriptive statistics by using a tool like https://csvkit.readthedocs.io.
Perhaps, as a datapusher "post-processor", we can call csvstats asynchronously to compile descriptive statistics.
For example, this sample file from data.boston.gov (rptcityscoresummary.csv.zip taken Mar 28), we got this csvstat report (rptcityscoresummary-descstats.txt ). It took a minute on a MacBook 2016 to scan the file and create the report.