Closed gonewest818 closed 10 months ago
Thanks for the fix @gonewest818! I'll take a deeper look at this one soon.
In the meantime do you happen to have the line that caused the error? Would love to throw it into the tests.
Unfortunately, no, I tried to isolate the line but ran out of time.
@gonewest818
Ok, I'll look for the bad line. Just requested updated data.
Can you merge master
into this branch? I fixed the tests with #15.
The problem character is in CollisionRecords.txt
. Here is the error and proceeding parse results:
156023 ['82122448', '2023', '20230721', '3300', '20230309', '0825', '3831', '011F', '4', '5']
156024 ['82122449', '2023', '20230721', '3300', '20230417', '0721', '2812', '012F', '1', '5']
156025 ['82122450', '2023', '20230721', '3300', '20230530', '0832', 'N6635', '011K', '2', '5']
Traceback (most recent call last):
File "/home/agude/.pyenv/versions/default/bin/switrs_to_sqlite", line 11, in <module>
load_entry_point('switrs-to-sqlite==4.0.1', 'console_scripts', 'switrs_to_sqlite')()
File "/home/agude/.pyenv/versions/3.7.9/envs/default/lib/python3.7/site-packages/switrs_to_sqlite/main.py", line 72, in main
for i, row in enumerate(reader):
File "/home/agude/.pyenv/versions/3.7.9/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 1303: invalid continuation byte
exit: 1
Looking at the raw file:
"82122448",2023,20230721,3300,20230309,0825,"3831","011F","4","5","4","3344","0","0","0","","0","062","SUMNER AV","SCHLEISMAN RD","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","04",21703,"","N","C","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96096,117.57573
"82122449",2023,20230721,3300,20230417,0721,"2812","012F","1","5","4","3344","0","0","0","","0","064","ARCHIBALD AV","WIND RIVER RD","0.00","","Y","B","-","N","",,,"","",,"","","","Y","4",0,1,2,"A","-","01",23152,"A","N","A","C","A","A","H","-","A","A","0","","","","","Y","Y","A","01",0,0,1,0,0,0,0,0,0,"-","-",33.93720,117.59315
"82122450",2023,20230721,3300,20230530,0832,"N6635","011K","2","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","3",0,1,2,"A","-","10",21950,"C","N","G","B","B","A","H","-","A","A","0","Y","","","","Y","","A","07",0,1,0,0,1,0,0,0,0,"-","-",33.96443,117.55305
"82122451",2023,20230721,3300,20230706,1850,"3319","011K","4","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","12",22450,"A","M","D","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96403,117.55255
"82122452",2023,20230721,3300,20230707,1106,"3831","012G","5","5","4","3344","0","0","0","","0","064","SCHLEISMAN RD","ARCHIBALD AV","0.00","","Y","A","-","N","",,,"","",,"","","","Y","0",0,0,1,"C","-","18",,"","N","F","A","A","A","H","-","A","A","0","","","","Y","Y","","-","- ",0,0,0,0,0,0,0,0,0,"-","-",33.96081,117.59322
"82122453",2023,20230721,3300,20230708,1443,"N5791","011C","6","5","4","3344","0","0","0","","0","061","LIMONITE AV","HAMNER AV","295.00","E","N","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","03",22350,"","M","C","C","A","A","H","-","A","A","0","","","","","Y","","A","07",0,0,0,0,0,0,0,0,0,"-","-",33.97534,117.55835
"82122454",2023,20230721,3300,20230710,0725,"2812","011F","1","5","4","3344","0","0","0","","0","062","68TH ST","ANGUS ST","0.00","","Y","A","-","N","",,,"","",,"","","","Y","3",0,1,2,"A","-","09",21801,"A","N","D","C","A","A","H","-","A","D","0","","","","","Y","","A","07",0,1,0,0,0,0,0,0,0,"-","-",33.96442,117.56298
So it's probably:
"82122451",2023,20230721,3300,20230706,1850,"3319","011K","4","5","4","3344","0","0","0","","0","061","68TH ST","WELLS SPRINGS ST","0.00","","Y","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","12",22450,"A","M","D","C","A","A","H","-","A","A","0","","","","","Y","","A","01",0,0,0,0,0,0,0,0,0,"-","-",33.96403,117.55255
I'll try isolating that one.
Breaks even if unziped.
Found it!
"82122456",2023,20230721,3300,20230717,0550,"2812","011A","1","5","4","3344","0","0","0","","0","060","GOODMAN WAY","CANTÙ GALLEANO RANCH RD","60.00","S","N","A","-","N","",,,"","",,"","","","N","0",0,0,2,"A","-","21",22106,"","N","C","C","A","A","H","-","B","A","0","","","","Y","Y","","F","27",0,0,0,0,0,0,0,0,0,"-","-",,
Specifically CANTÙ
.
Here is a CollisionRecords.txt
the reproduces the error with only a single line: CollisionRecords_error_only.txt
Nice find! When I was looking at it I was thrown off track, I think it must be that the Python interpreter reads ahead with larger chunks, so the parse error was occurring in a line we hadn't seen yet.
I rebased my branch to your master and force-pushed. CI checks are running.
Thanks again @gonewest818!
Somewhere in the past few years at least one row in
CollisionRecords.txt
contains a character that cannot be decoded with the default utf-8 codec. As a result, the process fails:This PR adds a command line flag that allows the user to specify the
errors
argument inopen()
andgzip.open()
.The command line flag is
-p
:and the default is None which results in "strict" error handling just as before. While technically there are other handlers available, these seemed sufficient for now. I successfully imported a SWITRS dataset obtained yesterday with this patch.
Note: this PR depends on #13.