Closed NealHumphrey closed 7 years ago
This is probably BOM headers: https://en.wikipedia.org/wiki/Byte_order_mark
Hey there; I've been experimenting with this issue and I've figured out a potential solution by amending the DataReader
class as you've requested. I tested it against a few of the opendata.dc.gov files you've mentioned and the fix seems to work correctly. I'd like to test it more thoroughly and review everything well before I submit a pull request.
(housing-insights) ubuntu@ip-[redacted]:~/projs/hi/housing-insights/python/scripts$ python load_data_bomfix.py docker rebuild
meta.json imported. JSON format is valid: True
Manifest table created in the SQL database
sql_manifest_exists: True
building_permits_2017: preparing to load row 2 from the manifest
File not found. Attempting to download file to disk:
https://opendata.arcgis.com/datasets/81a359c031464c53af6230338dbc848e_37.csv
Unknown content length; proceeding anyways...
Download complete. 430679 bytes downloaded.
Leading BOM (byte order mark) found in file '81a359c031464c53af6230338dbc848e_37.csv'; removing...
do_fields_match: True. meta.json and csv field lists match completely for 'building_permits'
Cleaning...
Loading...
replacing existing table
building_permits table can't be dropped because it doesn't exist
Table created: building_permits
opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_building_permits_2017.psv
data file loaded into database
.
.
.
(housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig
0000000 ▒ ▒ ▒ X , Y , O B J E C T I D ,
0000010 D C R A
0000014
(housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig
00000000 ef bb bf 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c |...X,Y,OBJECTID,|
00000010 44 43 52 41 |DCRA|
00000014
(housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv
0000000 X , Y , O B J E C T I D , D C R
0000010 A I N T
0000014
(housing-insights) ubuntu@ip-[redacted]:~/$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv
00000000 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c 44 43 52 |X,Y,OBJECTID,DCR|
00000010 41 49 4e 54 |AINT|
00000014
As far as your copy-paste-in-an-editor workaround, I assume that means that you performed this for most/all of the affected files hosted in the project's S3 bucket -- is that the case? I wanted to identify all the files affected so I can perform more thorough testing. Right now I only have the project's database credentials (from the onboarding/orientation last Tuesday 2017-04-24); is there any way I can get the project's AWS S3 credentials so I can sync my data via the CLI client?
Thanks, -meg
Hey, Neal--
I'm going to submit a pull request but I wanted to let you know that I took care of this issue.
manifest.csv
that (1) refers directly to DC's opendata vendor server rather than the S3 bucket with your manually-BOM-stripped .csv
files and (2) just includes the affected files you enumerated (for the building_permits and dc_tax tables) :$ cat manifest_test_bom_fix.csv
include_flag,destination_table,unique_data_id,data_date,encoding,local_folder,s3_folder,filepath,notes
____City Development Data__,,,,,,,,
use,building_permits,building_permits_2016,2/2/2017,latin-1,../../../data,https://opendata.arcgis.com/,datasets/81a359c031464c53af6230338dbc848e_37.csv,FOR TESTING PURPOSES ONLY -- TESTING BYTE-ORDER-MARK (BOM) REMOVAL (Ref: Issue #133)
_____Crosswalks_____,,,,,,,,
_____Neighborhood Info_____,,,,,,,,
_____Preservation Catalog_____,,,,,,,,
_____Planned Unit Developments_____,,,,,,,,
_____Property Tax_____,,,,,,,,
use,dc_tax,dc_tax_20170315,3/15/2017,latin-1,../../../data,https://opendata.arcgis.com/,datasets/014f4b4f94ea461498bfeba877d92319_56.csv,FOR TESTING PURPOSES ONLY -- TESTING BYTE-ORDER-MARK (BOM) REMOVAL (Ref: Issue #133)
_____Other DC Affordable Housing Data_____,,,,,,,,
_____Market Rate Rent_____,,,,,,,,
_____ACS Rent_____,,,,,,,,
____Crime Data____,,,,,,,,
____WMATA Data____,,,,,,,,
$ grep -A 3 "dc_tax" meta_test_bom_fix.json
"dc_tax": {
"cleaner": "DCTaxCleaner",
"fields": [
{
... and then after examining the tax data file for issues, drafted said cleaner :
$ grep -A 7 "DCTaxCleaner" ../housinginsights/ingestion/Cleaners.py
class DCTaxCleaner(CleanerBase):
def clean(self, row, row_num = None):
row = self.replace_nulls(row, null_values=['', '\\', None])
row['OWNER_ADDRESS_CITYSTZIP'] = self.null_value if row['OWNER_ADDRESS_CITYSTZIP']==',' else row['OWNER_ADDRESS_CITYSTZIP']
row['VACANT_USE'] = self.convert_boolean(row['VACANT_USE'].capitalize())
row = self.parse_dates(row)
return row
Note: I didn't push this version of Cleaners.py
; I just used it for testing.
But it does seem to adequately address the SQL-readiness issues & null value anomalies.
DataReader.py
to strip the BOM sequence from the raw file and correct the headers key dict.Ref: DataReader.py
$ python load_data_test_bom_fix.py docker rebuild
meta_test_bom_fix.json imported. JSON format is valid: True
Manifest table created in the SQL database
sql_manifest_exists: True
building_permits_2016: preparing to load row 2 from the manifest
File not found. Attempting to download file to disk: https://opendata.arcgis.com/datasets/81a359c031464c53af6230338dbc848e_37.csv
Download complete. 430679 bytes downloaded.
Leading BOM (byte order mark) found in file '81a359c031464c53af6230338dbc848e_37.csv'; removing...
do_fields_match: True. meta.json and csv field lists match completely for 'building_permits'
Cleaning...
Loading...
replacing existing table
building_permits table can't be dropped because it doesn't exist
Table created: building_permits
opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_building_permits_2016.psv
data file loaded into database
dc_tax_20170315: preparing to load row 8 from the manifest
File not found. Attempting to download file to disk: https://opendata.arcgis.com/datasets/014f4b4f94ea461498bfeba877d92319_56.csv
Download complete. 78192217 bytes downloaded.
Leading BOM (byte order mark) found in file '014f4b4f94ea461498bfeba877d92319_56.csv'; removing...
do_fields_match: True. meta.json and csv field lists match completely for 'dc_tax'
Cleaning...
Loading...
replacing existing table
dc_tax table can't be dropped because it doesn't exist
Table created: dc_tax
opening /home/ubuntu/projs/hi/housing-insights/python/logs/temp_dc_tax_20170315.psv
data file loaded into database
... and confirmation that the BOM was stripped from the files:
$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv.orig
0000000 ▒ ▒ ▒ O B J E C T I D , S S L ,
0000010 A S S E
0000014
$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv.orig
00000000 ef bb bf 4f 42 4a 45 43 54 49 44 2c 53 53 4c 2c |...OBJECTID,SSL,|
00000010 41 53 53 45 |ASSE|
00000014
$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv
0000000 O B J E C T I D , S S L , A S S
0000010 E S S O
0000014
$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/014f4b4f94ea461498bfeba877d92319_56.csv
00000000 4f 42 4a 45 43 54 49 44 2c 53 53 4c 2c 41 53 53 |OBJECTID,SSL,ASS|
00000010 45 53 53 4f |ESSO|
00000014
$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig
0000000 ▒ ▒ ▒ X , Y , O B J E C T I D ,
0000010 D C R A
0000014
$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv.orig
00000000 ef bb bf 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c |...X,Y,OBJECTID,|
00000010 44 43 52 41 |DCRA|
00000014
$ hexdump -c -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv
0000000 X , Y , O B J E C T I D , D C R
0000010 A I N T
0000014
$ hexdump -C -n 20 ~/projs/hi/housing-insights/data/datasets/81a359c031464c53af6230338dbc848e_37.csv
00000000 58 2c 59 2c 4f 42 4a 45 43 54 49 44 2c 44 43 52 |X,Y,OBJECTID,DCR|
00000010 41 49 4e 54 |AINT|
00000014
... and confirmation that there is exactly 3 bytes difference in file size between the downloaded version with the 0xEFBBBF
leading sequence vs. the BOM-stripped files generated by the updated DataReader.py
:
$ cd ~/projs/hi/housing-insights/data/datasets/
$ stat -c "%s %n" 81a359c031464c53af6230338dbc848e_37.csv*
430676 81a359c031464c53af6230338dbc848e_37.csv
430679 81a359c031464c53af6230338dbc848e_37.csv.orig
$ stat -c "%s %n" 014f4b4f94ea461498bfeba877d92319_56.csv*
78192214 014f4b4f94ea461498bfeba877d92319_56.csv
78192217 014f4b4f94ea461498bfeba877d92319_56.csv.orig
@dogtoothdigital This is awesome! Sorry I didn't get you credentials earlier. Did you fill out the onboarding survey? I don't see your github username in the responses. If you fill that out I can send you the secrets.json on either Slack or email (for future use).
Look forward to the PR, looks like it should be good to go.
@NealHumphrey Re onboarding survey -- thought I had bc I've been getting project email but maybe not; did it again just in case. Sorry! -meg
Reading from the beginning of the file (i.e. first header row) of files downloaded from opendata.dc.gov produces this type of thing:
"\u00ef\u00bb\u00bfobjectid"
This is true for both the dc_tax data and the building_permits data. Telling Python to parse as utf8 (vs. latin1) produces similar results, though slightly fewer extra byte characters.
For now I am copy-pasting all the contents of the file from a text editor into a brand new file to resolve this. Will need automated method in the future, probably by figuring out what the encoding type is, or parsing those extraneous bytes out.
We need the DataReader class to be modified to be able to handle this.