agude / SWITRS-to-SQLite

Python script for converting California's Statewide Integrated Traffic Records System (SWITRS) reports to SQLite.
https://alexgude.com/blog/switrs-to-sqlite/
Other
9 stars 3 forks source link

allow user to specify handler for text decoder errors #14

Closed gonewest818 closed 10 months ago

gonewest818 commented 10 months ago

Somewhere in the past few years at least one row in CollisionRecords.txt contains a character that cannot be decoded with the default utf-8 codec. As a result, the process fails:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 3988: invalid continuation byte

This PR adds a command line flag that allows the user to specify the errors argument in open() and gzip.open().

The command line flag is -p:

usage: switrs_to_sqlite [-h] [--version] [-p {strict,ignore,replace}] [-o OUTPUT_FILE]
                        collision_record party_record victim_record

Convert SWITRS text files to a SQLite3 database

positional arguments:
  collision_record      the CollisionRecords.txt file or the same file gzipped
  party_record          the PartyRecords.txt file or the same file gzipped
  victim_record         the VictimRecords.txt file or the same file gzipped

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -p {strict,ignore,replace}, --parse-error {strict,ignore,replace}
                        how to handle parsing errors
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        file to save the database to

and the default is None which results in "strict" error handling just as before. While technically there are other handlers available, these seemed sufficient for now. I successfully imported a SWITRS dataset obtained yesterday with this patch.

Note: this PR depends on #13.

agude commented 10 months ago

Thanks for the fix @gonewest818! I'll take a deeper look at this one soon.

In the meantime do you happen to have the line that caused the error? Would love to throw it into the tests.

gonewest818 commented 10 months ago

Unfortunately, no, I tried to isolate the line but ran out of time.

agude commented 10 months ago

@gonewest818

Ok, I'll look for the bad line. Just requested updated data.

Can you merge master into this branch? I fixed the tests with #15.

agude commented 10 months ago

The problem character is in CollisionRecords.txt. Here is the error and proceeding parse results:

156023 ['82122448', '2023', '20230721', '3300', '20230309', '0825', '3831', '011F', '4', '5']
156024 ['82122449', '2023', '20230721', '3300', '20230417', '0721', '2812', '012F', '1', '5']
156025 ['82122450', '2023', '20230721', '3300', '20230530', '0832', 'N6635', '011K', '2', '5']
Traceback (most recent call last):
  File "/home/agude/.pyenv/versions/default/bin/switrs_to_sqlite", line 11, in <module>
    load_entry_point('switrs-to-sqlite==4.0.1', 'console_scripts', 'switrs_to_sqlite')()
  File "/home/agude/.pyenv/versions/3.7.9/envs/default/lib/python3.7/site-packages/switrs_to_sqlite/main.py", line 72, in main
    for i, row in enumerate(reader):
  File "/home/agude/.pyenv/versions/3.7.9/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 1303: invalid continuation byte
exit: 1

Looking at the raw file:

"82122448",2023,20230721,3300,20230309,0825,"3831","011F","4","5","4","3344","0","0","0","","0","062","SUMNER AV","SCHLEISMAN RD","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","04",21703,"","N","C","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96096,117.57573
"82122449",2023,20230721,3300,20230417,0721,"2812","012F","1","5","4","3344","0","0","0","","0","064","ARCHIBALD AV","WIND RIVER RD","0.00","","Y","B","-","N","",,,"","",,"","","","Y","4",0,1,2,"A","-","01",23152,"A","N","A","C","A","A","H","-","A","A","0","","","","","Y","Y","A","01",0,0,1,0,0,0,0,0,0,"-","-",33.93720,117.59315
"82122450",2023,20230721,3300,20230530,0832,"N6635","011K","2","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","3",0,1,2,"A","-","10",21950,"C","N","G","B","B","A","H","-","A","A","0","Y","","","","Y","","A","07",0,1,0,0,1,0,0,0,0,"-","-",33.96443,117.55305
"82122451",2023,20230721,3300,20230706,1850,"3319","011K","4","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","12",22450,"A","M","D","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96403,117.55255
"82122452",2023,20230721,3300,20230707,1106,"3831","012G","5","5","4","3344","0","0","0","","0","064","SCHLEISMAN RD","ARCHIBALD AV","0.00","","Y","A","-","N","",,,"","",,"","","","Y","0",0,0,1,"C","-","18",,"","N","F","A","A","A","H","-","A","A","0","","","","Y","Y","","-","- ",0,0,0,0,0,0,0,0,0,"-","-",33.96081,117.59322
"82122453",2023,20230721,3300,20230708,1443,"N5791","011C","6","5","4","3344","0","0","0","","0","061","LIMONITE AV","HAMNER AV","295.00","E","N","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","03",22350,"","M","C","C","A","A","H","-","A","A","0","","","","","Y","","A","07",0,0,0,0,0,0,0,0,0,"-","-",33.97534,117.55835
"82122454",2023,20230721,3300,20230710,0725,"2812","011F","1","5","4","3344","0","0","0","","0","062","68TH ST","ANGUS ST","0.00","","Y","A","-","N","",,,"","",,"","","","Y","3",0,1,2,"A","-","09",21801,"A","N","D","C","A","A","H","-","A","D","0","","","","","Y","","A","07",0,1,0,0,0,0,0,0,0,"-","-",33.96442,117.56298

So it's probably:

"82122451",2023,20230721,3300,20230706,1850,"3319","011K","4","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","12",22450,"A","M","D","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96403,117.55255

I'll try isolating that one.

Breaks even if unziped.

agude commented 10 months ago

Found it!

"82122456",2023,20230721,3300,20230717,0550,"2812","011A","1","5","4","3344","0","0","0","","0","060","GOODMAN WAY","CANTÙ GALLEANO RANCH RD","60.00","S","N","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","21",22106,"","N","C","C","A","A","H","-","B","A","0","","","","Y","Y","","F","27",0,0,0,0,0,0,0,0,0,"-","-",,

Specifically CANTÙ.

Here is a CollisionRecords.txt the reproduces the error with only a single line: CollisionRecords_error_only.txt

gonewest818 commented 10 months ago

Nice find! When I was looking at it I was thrown off track, I think it must be that the Python interpreter reads ahead with larger chunks, so the parse error was occurring in a line we hadn't seen yet.

I rebased my branch to your master and force-pushed. CI checks are running.

agude commented 10 months ago

Thanks again @gonewest818!