opendata.dc.gov files have extra bytes at file start

NealHumphrey commented 7 years ago

Reading from the beginning of the file (i.e. first header row) of files downloaded from opendata.dc.gov produces this type of thing:

"\u00ef\u00bb\u00bfobjectid"

This is true for both the dc_tax data and the building_permits data. Telling Python to parse as utf8 (vs. latin1) produces similar results, though slightly fewer extra byte characters.

For now I am copy-pasting all the contents of the file from a text editor into a brand new file to resolve this. Will need automated method in the future, probably by figuring out what the encoding type is, or parsing those extraneous bytes out.

We need the DataReader class to be modified to be able to handle this.

NealHumphrey commented 7 years ago

This is probably BOM headers: https://en.wikipedia.org/wiki/Byte_order_mark

dogtoothdigital commented 7 years ago

Hey there; I've been experimenting with this issue and I've figured out a potential solution by amending the DataReader class as you've requested. I tested it against a few of the opendata.dc.gov files you've mentioned and the fix seems to work correctly. I'd like to test it more thoroughly and review everything well before I submit a pull request.

(housing-insights) ubuntu@ip-[redacted]:~/projs/hi/housing-insights/python/scripts$ python load_data_bomfix.py docker rebuild meta.json imported. JSON format is valid: True Manifest table created in the SQL database sql_manifest_exists: True building_permits_2017: preparing to load row 2 from the manifest File not found. Attempting to download file to disk: https://opendata.arcgis.com/datasets/81a359c031464c53af6230338dbc848e_37.csv Unknown content length; proceeding anyways... Download complete. 430679 bytes downloaded. Leading BOM (byte order mark) found in file '81a359c031464c53af6230338dbc848e_37.csv'; removing... do_fields_match: True. meta.json and csv field lists match completely for 'building_permits' Cleaning... Loading... replacing existing table building_permits table can't be dropped because it doesn't exist Table created: building_permits opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_building_permits_2017.psv data file loaded into database . . .

(housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig 0000000 ▒ ▒ ▒ X , Y , O B J E C T I D , 0000010 D C R A 0000014 (housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig 00000000 ef bb bf 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c |...X,Y,OBJECTID,| 00000010 44 43 52 41 |DCRA| 00000014 (housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv 0000000 X , Y , O B J E C T I D , D C R 0000010 A I N T 0000014 (housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv 00000000 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c 44 43 52 |X,Y,OBJECTID,DCR| 00000010 41 49 4e 54 |AINT| 00000014

As far as your copy-paste-in-an-editor workaround, I assume that means that you performed this for most/all of the affected files hosted in the project's S3 bucket -- is that the case? I wanted to identify all the files affected so I can perform more thorough testing. Right now I only have the project's database credentials (from the onboarding/orientation last Tuesday 2017-04-24); is there any way I can get the project's AWS S3 credentials so I can sync my data via the CLI client?

Thanks, -meg

dogtoothdigital commented 7 years ago

Hey, Neal--

I'm going to submit a pull request but I wanted to let you know that I took care of this issue.

Created a test version of manifest.csv that (1) refers directly to DC's opendata vendor server rather than the S3 bucket with your manually-BOM-stripped .csv files and (2) just includes the affected files you enumerated (for the building_permits and dc_tax tables) :

$ cat manifest_test_bom_fix.csv include_flag,destination_table,unique_data_id,data_date,encoding,local_folder,s3_folder,filepath,notes ____City Development Data__,,,,,,,, use,building_permits,building_permits_2016,2/2/2017,latin-1,../../../data,https://opendata.arcgis.com/,datasets/81a359c031464c53af6230338dbc848e_37.csv,FOR TESTING PURPOSES ONLY -- TESTING BYTE-ORDER-MARK (BOM) REMOVAL (Ref: Issue #133) _____Crosswalks_____,,,,,,,, _____Neighborhood Info_____,,,,,,,, _____Preservation Catalog_____,,,,,,,, _____Planned Unit Developments_____,,,,,,,, _____Property Tax_____,,,,,,,, use,dc_tax,dc_tax_20170315,3/15/2017,latin-1,../../../data,https://opendata.arcgis.com/,datasets/014f4b4f94ea461498bfeba877d92319_56.csv,FOR TESTING PURPOSES ONLY -- TESTING BYTE-ORDER-MARK (BOM) REMOVAL (Ref: Issue #133) _____Other DC Affordable Housing Data_____,,,,,,,, _____Market Rate Rent_____,,,,,,,, _____ACS Rent_____,,,,,,,, ____Crime Data____,,,,,,,, ____WMATA Data____,,,,,,,,

Given that the dc_tax cleaner is missing, created a cleaner to get the load script to work without error. Created a test version of meta.csv for the ad-hoc cleaner ...

$ grep -A 3 "dc_tax" meta_test_bom_fix.json "dc_tax": { "cleaner": "DCTaxCleaner", "fields": [ {

... and then after examining the tax data file for issues, drafted said cleaner :

$ grep -A 7 "DCTaxCleaner" ../housinginsights/ingestion/Cleaners.py class DCTaxCleaner(CleanerBase): def clean(self, row, row_num = None): row = self.replace_nulls(row, null_values=['', '\\', None]) row['OWNER_ADDRESS_CITYSTZIP'] = self.null_value if row['OWNER_ADDRESS_CITYSTZIP']==',' else row['OWNER_ADDRESS_CITYSTZIP'] row['VACANT_USE'] = self.convert_boolean(row['VACANT_USE'].capitalize()) row = self.parse_dates(row) return row

Note: I didn't push this version of Cleaners.py; I just used it for testing. But it does seem to adequately address the SQL-readiness issues & null value anomalies.

Adapted DataReader.py to strip the BOM sequence from the raw file and correct the headers key dict.

Ref: DataReader.py

Test run results ...

$ python load_data_test_bom_fix.py docker rebuild meta_test_bom_fix.json imported. JSON format is valid: True Manifest table created in the SQL database sql_manifest_exists: True building_permits_2016: preparing to load row 2 from the manifest File not found. Attempting to download file to disk: https://opendata.arcgis.com/datasets/81a359c031464c53af6230338dbc848e_37.csv Download complete. 430679 bytes downloaded. Leading BOM (byte order mark) found in file '81a359c031464c53af6230338dbc848e_37.csv'; removing... do_fields_match: True. meta.json and csv field lists match completely for 'building_permits' Cleaning... Loading... replacing existing table building_permits table can't be dropped because it doesn't exist Table created: building_permits opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_building_permits_2016.psv data file loaded into database dc_tax_20170315: preparing to load row 8 from the manifest File not found. Attempting to download file to disk: https://opendata.arcgis.com/datasets/014f4b4f94ea461498bfeba877d92319_56.csv Download complete. 78192217 bytes downloaded. Leading BOM (byte order mark) found in file '014f4b4f94ea461498bfeba877d92319_56.csv'; removing... do_fields_match: True. meta.json and csv field lists match completely for 'dc_tax' Cleaning... Loading... replacing existing table dc_tax table can't be dropped because it doesn't exist Table created: dc_tax opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_dc_tax_20170315.psv data file loaded into database

... and confirmation that the BOM was stripped from the files:

$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv.orig 0000000 ▒ ▒ ▒ O B J E C T I D , S S L , 0000010 A S S E 0000014 $ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv.orig 00000000 ef bb bf 4f 42 4a 45 43 54 49 44 2c 53 53 4c 2c |...OBJECTID,SSL,| 00000010 41 53 53 45 |ASSE| 00000014 $ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv 0000000 O B J E C T I D , S S L , A S S 0000010 E S S O 0000014 $ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv 00000000 4f 42 4a 45 43 54 49 44 2c 53 53 4c 2c 41 53 53 |OBJECTID,SSL,ASS| 00000010 45 53 53 4f |ESSO| 00000014 $ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig 0000000 ▒ ▒ ▒ X , Y , O B J E C T I D , 0000010 D C R A 0000014 $ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig 00000000 ef bb bf 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c |...X,Y,OBJECTID,| 00000010 44 43 52 41 |DCRA| 00000014 $ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv 0000000 X , Y , O B J E C T I D , D C R 0000010 A I N T 0000014 $ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv 00000000 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c 44 43 52 |X,Y,OBJECTID,DCR| 00000010 41 49 4e 54 |AINT| 00000014

... and confirmation that there is exactly 3 bytes difference in file size between the downloaded version with the 0xEFBBBF leading sequence vs. the BOM-stripped files generated by the updated DataReader.py :

$ cd ~/projs/hi/housing-insights/data/datasets/ $ stat -c "%s %n" 81a359c031464c53af6230338dbc848e_37.csv* 430676 81a359c031464c53af6230338dbc848e_37.csv 430679 81a359c031464c53af6230338dbc848e_37.csv.orig $ stat -c "%s %n" 014f4b4f94ea461498bfeba877d92319_56.csv* 78192214 014f4b4f94ea461498bfeba877d92319_56.csv 78192217 014f4b4f94ea461498bfeba877d92319_56.csv.orig

NealHumphrey commented 7 years ago

@dogtoothdigital This is awesome! Sorry I didn't get you credentials earlier. Did you fill out the onboarding survey? I don't see your github username in the responses. If you fill that out I can send you the secrets.json on either Slack or email (for future use).

Look forward to the PR, looks like it should be good to go.

dogtoothdigital commented 7 years ago

@NealHumphrey Re onboarding survey -- thought I had bc I've been getting project email but maybe not; did it again just in case. Sorry! -meg

focusconsulting / housing-insights

opendata.dc.gov files have extra bytes at file start #133