CityOfPhiladelphia / open-budget-data-transformer

Master repo of data cleanup scripts for the various budget cycles
1 stars 1 forks source link

Normalise subclass names related to overtime #8

Closed timwis closed 6 years ago

timwis commented 7 years ago

FY16-adopted used:

Overtime Civilian
Overtime Meals
Overtime Shift Dual Relief
Overtime Shift Uniform

FY17-adopted uses:

Overtime Meals
Overtime-2.5x Hourly Rate
Overtime-Civilian
Overtime/Shift-Dual/Relief
Overtime/Shift-Fire Out of Class
Overtime/Shift-Uniform

The result is duplicate entries in the application since GROUP BY fails. Need to pick a format and then add the cleanup to clean_subclass in standardize_fy16_adopted.py or clean_fy17_adopted.py (depending on which year needs to change).

timwis commented 7 years ago

The underlying issue is the difference between to-title-case and the python alternative. Perhaps the best thing to do is run the raw FY16-adopted file through the new python tool, compare the difference between that and the currently published FY16-adopted (cleaned) file, ensure they're only in how title case is formatted, then use the new file to generate the flare.

timwis commented 7 years ago

I used this script to fetch the FY16-adopted.csv file from the open data portal, and make some basic adjustments to it in order to match known format differences.

# Fetch FY16-adopted.csv from open data portal
curl 'https://data.phila.gov/resource/6quj-54t7.csv?$limit=5000' | \

# Rearrange columns
csvcut -c fiscal_year,fund,department,class_id,class,minor_class_id,minor_class,total | \

# Rename minor -> sub in 2 columns
header -r fiscal_year,fund,department,class_id,class,subclass_id,subclass,total | \

# Rename funds
sed 's/,General Operating Fund,/,General Fund,/g' | \
sed 's/,Water Operating Fund,/,Water Fund,/g' | \

# Remove currency formatting
sed -E 's/,\$([0-9]+)\.00/,\1/g' | \

# Sort the file, excluding the header
csvsort

I then compared the output to the output of the new FY17-adopted cleanup script using daff and published the diffs here. It looks like most of them are related to title case and department name cleaning, which is benign. But it looks like in some cases, daff considers the same row in both files to be a diff, and so it's a bit hard to keep track in that diff html file (which is generated by daff). I also don't know what the : symbol means in diff format.

We can also use git to see the differences, which may be a bit more reliable (reference):

git diff --word-diff --word-diff-regex "[^,\n]+[,\n]|[,]"

Or perhaps even better:

git diff --word-diff --word-diff-regex "[^,]+"
timwis commented 7 years ago

There are going to be some differences such as "Mayor's Office of Sustainability" -> "Office of Sustainability" and perhaps MOTU -> OTIS that need to remain different for the fy17 comparison.

timwis commented 7 years ago

Comparing the outputted diffs seems a bit too much. Might be the best we can do is aggregate totals, grouping by dept/class_id/subclass_id, and also count the number of rows.

timwis commented 7 years ago

Okay, I use the above script to get a normalised version of the published csv file. I then pipe that into the below aggregation script to aggregate it and sort the aggregated file:

csvsql --query "select fund, department, class_id, subclass_id, sum(total) from stdin group by fund, department, class_id, subclass_id" | \

# Sort by less-likely-to-change columns
csvsort -c fund,class_id,subclass_id,department

Then I take the new file (which was created by running the fy16-adopted-unique file budget gave me through the new cleanup script) and aggregate it as well using the above script. Then I use git diff as described above and pass the result into aha to render it as HTML.

git diff --word-diff --word-diff-regex "[^,]+" --color-words | aha > index.html

Here's the resulting diff.

timwis commented 7 years ago

:( This is hard. It would be much easier if the published fy16-adopted file included department IDs. Then we could include that in the aggregations...

I wrote a little script to extract department names and matching IDs from the fy2016-adopted-unique.csv file we got from budget last year. The result is:

key,department
1,City Council
3,"Mayor's Office-labor Relations"
4,Office of Innovation & Technology
5,"Mayor's Office"
6,Office of Housing
8,"Mayor's Office-community Services"
10,Managing Director
11,Police
12,Streets
13,Fire
14,Health
15,Office of Bh/mr Services
16,Recreation
18,Atwater Kent Museum
20,Public Property
22,Human Services
23,Prisons
24,Office of Supportive Housing
25,Fleet Management
26,Licenses & Inspections
27,Board of Licenses & Inspections
28,Water
29,Board of Building Standards
30,Zoning Board
31,Records
32,Historical Commission
34,Art Museum
35,Director of Finance
36,Revenue
37,Sinking Fund Commission
38,Procurement
40,City Treasurer
41,City Representative
42,Commerce
44,Law
45,Board of Ethics
46,"Mayor's Office of Transportation"
47,Youth Commission
48,Mayor-office of the Inspector General
49,Mayor-office of Sustainability
50,Mayors Office-mural Arts Program
51,City Planning Commission
52,Free Library
53,Board of Pensions
54,Commission on Human Relations
55,Civil Service Commission
56,Personnel
58,Office of Arts & Culture
59,Office of Property Assessments
61,City Controller
63,Board of Revision of Taxes
68,Register of Wills
69,District Attorney
70,Sheriff
73,City Commissioners
84,1st Judicial District

Now we can probably use csvjoin to add the dept id to the published file.

timwis commented 6 years ago

Closing as wontfix