fgregg / chicago-historical-addresses

Digitizing crosswalks of historical Chicago addresses
3 stars 0 forks source link

Work out spreadsheet model for transcribing data / correcting OCR #8

Open fgregg opened 2 years ago

fgregg commented 2 years ago

Staring with some example pages of @tewhalen's OCR, let's create a spreadsheet model that can capture the complexity of representations (see #5).

@tewhalen, could you post your address sheets for a few pages that seem like they cover some interesting variations and could you also post the images of the relevant pages.

if i understand correctly, you previously did some processing to remove addresses that seemed wrong. for this exercise, it would be ideal if you could disable so that we could approximately have a 1-1 correspondence between rows extracted and rows in the images.

tewhalen commented 2 years ago

Let's get into it!

The OCR performs its own segmentation, and it's a little unpredictable. Here's a chunk of a page, and here's how the OCR chose to break it up.

column-1 debug

(I've added some debugging info if you zoom way in, showing the text that the OCR identified and its confidence level. The color indicates how confident it is.)

The actual raw data looks like this:

""  ,"level" ,"page_num" ,"block_num" ,"par_num" ,"line_num" ,"word_num" ,"left" ,"top" ,"width" ,"height" ,"conf" ,"text"         ,"right" ,"bot"
301 ,5       ,1          ,1           ,2         ,59         ,1          ,97     ,3122  ,64      ,52       ,92     ,"E."           ,161     ,3174
302 ,5       ,1          ,1           ,2         ,59         ,2          ,215    ,3118  ,384     ,56       ,68     ,"Ravenswood"   ,599     ,3174
303 ,5       ,1          ,1           ,2         ,59         ,3          ,647    ,3119  ,104     ,54       ,62     ,"Pk."          ,751     ,3173
305 ,5       ,1          ,1           ,2         ,60         ,1          ,90     ,3201  ,108     ,39       ,89     ,"3201"         ,198     ,3240
306 ,5       ,1          ,1           ,2         ,60         ,2          ,282    ,3201  ,78      ,39       ,62     ,"382"          ,360     ,3240
308 ,5       ,1          ,1           ,2         ,61         ,1          ,90     ,3247  ,108     ,39       ,93     ,"3203"         ,198     ,3286
309 ,5       ,1          ,1           ,2         ,61         ,2          ,282    ,3247  ,78      ,39       ,36     ,"3861"         ,360     ,3286
311 ,5       ,1          ,1           ,2         ,62         ,1          ,90     ,3294  ,110     ,40       ,91     ,"3215"         ,200     ,3334
312 ,5       ,1          ,1           ,2         ,62         ,2          ,282    ,3294  ,78      ,39       ,91     ,"396"          ,360     ,3333
314 ,5       ,1          ,1           ,2         ,63         ,1          ,91     ,3341  ,108     ,39       ,86     ,"3217"         ,199     ,3380
315 ,5       ,1          ,1           ,2         ,63         ,2          ,283    ,3340  ,82      ,40       ,41     ,"398"          ,365     ,3380
317 ,5       ,1          ,1           ,2         ,64         ,1          ,90     ,3387  ,107     ,39       ,85     ,"3221"         ,197     ,3426
318 ,5       ,1          ,1           ,2         ,64         ,2          ,279    ,3387  ,86      ,39       ,93     ,"400"          ,365     ,3426
320 ,5       ,1          ,1           ,2         ,65         ,1          ,92     ,3434  ,106     ,38       ,84     ,"3223"         ,198     ,3472
321 ,5       ,1          ,1           ,2         ,65         ,2          ,282    ,3433  ,80      ,39       ,78     ,"404"          ,362     ,3472
323 ,5       ,1          ,1           ,2         ,66         ,1          ,88     ,3481  ,110     ,39       ,11     ,"3255"         ,198     ,3520
324 ,5       ,1          ,1           ,2         ,66         ,2          ,278    ,3479  ,82      ,40       ,92     ,"406"          ,360     ,3519
325 ,5       ,1          ,1           ,2         ,66         ,3          ,371    ,3460  ,70      ,69       ,92     ,"|"            ,441     ,3529
326 ,5       ,1          ,1           ,2         ,66         ,4          ,478    ,3477  ,254     ,44       ,22     ,"0ddCo0nt"     ,732     ,3521
328 ,5       ,1          ,1           ,2         ,67         ,1          ,88     ,3528  ,108     ,38       ,89     ,"3311"         ,196     ,3566
329 ,5       ,1          ,1           ,2         ,67         ,2          ,278    ,3526  ,84      ,40       ,44     ,"558"          ,362     ,3566
331 ,5       ,1          ,1           ,2         ,68         ,1          ,91     ,3575  ,112     ,39       ,92     ,"3345"         ,203     ,3614
332 ,5       ,1          ,1           ,2         ,68         ,2          ,277    ,3574  ,86      ,39       ,92     ,"492"          ,363     ,3613
334 ,5       ,1          ,1           ,2         ,69         ,1          ,83     ,3619  ,348     ,49       ,38     ,"3445606Nwpt|" ,431     ,3668
335 ,5       ,1          ,1           ,2         ,69         ,2          ,475    ,3601  ,90      ,110      ,0      ,"4797"         ,565     ,3711
336 ,5       ,1          ,1           ,2         ,69         ,3          ,637    ,3619  ,110     ,41       ,93     ,"1416"         ,747     ,3660
338 ,5       ,1          ,1           ,2         ,70         ,1          ,534    ,3677  ,26      ,26       ,49     ,"o"            ,560     ,3703
340 ,5       ,1          ,1           ,2         ,71         ,1          ,88     ,3713  ,112     ,39       ,92     ,"3513"         ,200     ,3752
341 ,5       ,1          ,1           ,2         ,71         ,2          ,276    ,3712  ,84      ,41       ,90     ,"588"          ,360     ,3753
342 ,5       ,1          ,1           ,2         ,71         ,3          ,378    ,3694  ,69      ,68       ,92     ,"|"            ,447     ,3762
343 ,5       ,1          ,1           ,2         ,71         ,4          ,474    ,3713  ,110     ,40       ,57     ,"4757"         ,584     ,3753
344 ,5       ,1          ,1           ,2         ,71         ,5          ,690    ,3713  ,26      ,18       ,90     ,""""           ,716     ,3731
346 ,5       ,1          ,1           ,2         ,72         ,1          ,88     ,3760  ,110     ,38       ,91     ,"3519"         ,198     ,3798
347 ,5       ,1          ,1           ,2         ,72         ,2          ,280    ,3758  ,80      ,41       ,90     ,"592"          ,360     ,3799
348 ,5       ,1          ,1           ,2         ,72         ,3          ,378    ,3740  ,58      ,68       ,92     ,"|"            ,436     ,3808
349 ,5       ,1          ,1           ,2         ,72         ,4          ,476    ,3757  ,107     ,42       ,80     ,"4601"         ,583     ,3799
350 ,5       ,1          ,1           ,2         ,72         ,5          ,640    ,3757  ,108     ,41       ,92     ,"1450"         ,748     ,3798
352 ,5       ,1          ,1           ,2         ,73         ,1          ,91     ,3808  ,106     ,38       ,83     ,"3523"         ,197     ,3846
353 ,5       ,1          ,1           ,2         ,73         ,2          ,281    ,3806  ,78      ,39       ,92     ,"596"          ,359     ,3845
354 ,5       ,1          ,1           ,2         ,73         ,3          ,370    ,3787  ,69      ,68       ,89     ,"|"            ,439     ,3855
355 ,5       ,1          ,1           ,2         ,73         ,4          ,479    ,3804  ,106     ,41       ,29     ,"41805"        ,585     ,3845
356 ,5       ,1          ,1           ,2         ,73         ,5          ,641    ,3805  ,106     ,39       ,31     ,"145½"         ,747     ,3844
358 ,5       ,1          ,1           ,2         ,74         ,1          ,89     ,3852  ,108     ,40       ,88     ,"3529"         ,197     ,3892
359 ,5       ,1          ,1           ,2         ,74         ,2          ,281    ,3852  ,80      ,39       ,91     ,"608"          ,361     ,3891
360 ,5       ,1          ,1           ,2         ,74         ,3          ,374    ,3832  ,68      ,68       ,91     ,"|"            ,442     ,3900
361 ,5       ,1          ,1           ,2         ,74         ,4          ,477    ,3851  ,106     ,40       ,92     ,"4809"         ,583     ,3891
362 ,5       ,1          ,1           ,2         ,74         ,5          ,641    ,3851  ,108     ,39       ,92     ,"1458"         ,749     ,3890
364 ,5       ,1          ,1           ,2         ,75         ,1          ,90     ,3899  ,106     ,39       ,75     ,"3615"         ,196     ,3938
365 ,5       ,1          ,1           ,2         ,75         ,2          ,280    ,3898  ,78      ,39       ,92     ,"656"          ,358     ,3937
366 ,5       ,1          ,1           ,2         ,75         ,3          ,369    ,3878  ,69      ,68       ,91     ,"|"            ,438     ,3946
367 ,5       ,1          ,1           ,2         ,75         ,4          ,476    ,3897  ,106     ,40       ,72     ,"4823"         ,582     ,3937
368 ,5       ,1          ,1           ,2         ,75         ,5          ,639    ,3897  ,109     ,40       ,80     ,"1470"         ,748     ,3937
370 ,5       ,1          ,1           ,2         ,76         ,1          ,90     ,3946  ,106     ,38       ,92     ,"3619"         ,196     ,3984
371 ,5       ,1          ,1           ,2         ,76         ,2          ,280    ,3946  ,82      ,39       ,86     ,"658"          ,362     ,3985
372 ,5       ,1          ,1           ,2         ,76         ,3          ,367    ,3925  ,70      ,69       ,90     ,"|"            ,437     ,3994
373 ,5       ,1          ,1           ,2         ,76         ,4          ,476    ,3943  ,106     ,40       ,27     ,"4833"         ,582     ,3983
374 ,5       ,1          ,1           ,2         ,76         ,5          ,640    ,3943  ,106     ,39       ,40     ,"1462"         ,746     ,3982
376 ,5       ,1          ,1           ,2         ,77         ,1          ,88     ,3992  ,107     ,38       ,90     ,"3621"         ,195     ,4030
377 ,5       ,1          ,1           ,2         ,77         ,2          ,280    ,3992  ,78      ,39       ,90     ,"662"          ,358     ,4031
378 ,5       ,1          ,1           ,2         ,77         ,3          ,373    ,3972  ,68      ,68       ,92     ,"|"            ,441     ,4040
379 ,5       ,1          ,1           ,2         ,77         ,4          ,476    ,3991  ,108     ,40       ,89     ,"4837"         ,584     ,4031
380 ,5       ,1          ,1           ,2         ,77         ,5          ,639    ,3991  ,107     ,39       ,90     ,"1486"         ,746     ,4030

The "line_no" that comes out of tesseract is generally pretty reliable for identifying rows, but note that there are a few bad segments found (including some of these vertical lines). So, for instance, it doesn't quite identify that lonely "to" as a row of its own, and sometimes it finds overlapping segments. The OCR also doesn't positively identify the "Old" and "New" columns, it only finds horizontal organization.

tewhalen commented 2 years ago

debug More than 90% of the time, the text just contains four simple columns of numbers, and those are where the OCR does best at working properly. It's mostly the weird stuff that causes it to go wrong and need hand work. Let's look closely at the middle part of this, where it's weird and needs correction (and for clarity I'll leave out the pixel locations of the boxes)

    ,level ,page_num ,block_num ,par_num ,line_num ,word_num ,conf ,text
331 ,5     ,1        ,1         ,2       ,68       ,1        ,92   ,3345
332 ,5     ,1        ,1         ,2       ,68       ,2        ,92   ,492
334 ,5     ,1        ,1         ,2       ,69       ,1        ,38   ,3445606Nwpt|
335 ,5     ,1        ,1         ,2       ,69       ,2        ,0    ,4797
336 ,5     ,1        ,1         ,2       ,69       ,3        ,93   ,1416
338 ,5     ,1        ,1         ,2       ,70       ,1        ,49   ,o
340 ,5     ,1        ,1         ,2       ,71       ,1        ,92   ,3513
341 ,5     ,1        ,1         ,2       ,71       ,2        ,90   ,588
342 ,5     ,1        ,1         ,2       ,71       ,3        ,92   ,|
343 ,5     ,1        ,1         ,2       ,71       ,4        ,57   ,4757
344 ,5     ,1        ,1         ,2       ,71       ,5        ,90   ,""""
346 ,5     ,1        ,1         ,2       ,72       ,1        ,91   ,3519
347 ,5     ,1        ,1         ,2       ,72       ,2        ,90   ,592
348 ,5     ,1        ,1         ,2       ,72       ,3        ,92   ,|
349 ,5     ,1        ,1         ,2       ,72       ,4        ,80   ,4601
350 ,5     ,1        ,1         ,2       ,72       ,5        ,92   ,1450

We can break this up into typographical rows. I assume the sporadically-detected vertical bar is not interesting and can be thrown away.

row ,odd_new ,odd_old ,even_new ,even_old
68  ,3345    ,492     ,         ,
69  ,3445    ,606Nwpt ,4747     ,1416
70  ,        ,        ,to       ,
71  ,3513    ,588     ,4757     ,""""
72  ,3519    ,592     ,4801     ,1450

I would suggest that we should at least attempt to flag address pairs that aren't "normal" and will need an additional step to be processed into a clear mapping.

row ,odd_new ,odd_old ,odd_flag ,even_new ,even_old ,even_flag
68  ,3345    ,492     ,False    ,         ,         ,False
69  ,3445    ,606Nwpt ,True     ,4747     ,1416     ,False
70  ,        ,        ,False    ,to       ,         ,True
71  ,3513    ,588     ,False    ,4757     ,""""     ,True
72  ,3519    ,592     ,False    ,4801     ,1450     ,False

Should be noted that already at this point, the "odd" and "even"-ness of the columns is screwed up and can't be used to automatically detect errors. Perhaps all the odd-but-correct addresses in the even column need to be flagged somehow?

tewhalen commented 2 years ago

This is all to leave aside the question of the street names, any header text, and editorial comments like "Odd Cont", which don't cleanly fit into an Odd/Even New/Old columnar format

    ,level ,page_num ,block_num ,par_num ,line_num ,word_num ,conf ,text
4   ,5     ,1        ,1         ,1       ,1        ,1        ,92   ,West
5   ,5     ,1        ,1         ,1       ,1        ,2        ,59   ,Randolph
6   ,5     ,1        ,1         ,1       ,1        ,3        ,81   ,St.
8   ,5     ,1        ,1         ,1       ,2        ,1        ,81   ,CONTINUED
10  ,5     ,1        ,1         ,1       ,3        ,1        ,31   ,Odd
11  ,5     ,1        ,1         ,1       ,3        ,2        ,31   ,Nos.
12  ,5     ,1        ,1         ,1       ,3        ,3        ,88   ,|Even
13  ,5     ,1        ,1         ,1       ,3        ,4        ,61   ,Nos.
15  ,5     ,1        ,1         ,1       ,4        ,1        ,92   ,New
16  ,5     ,1        ,1         ,1       ,4        ,2        ,92   ,Old
17  ,5     ,1        ,1         ,1       ,4        ,3        ,91   ,New
18  ,5     ,1        ,1         ,1       ,4        ,4        ,92   ,Old
323 ,5     ,1        ,1         ,2       ,66       ,1        ,11   ,3255
324 ,5     ,1        ,1         ,2       ,66       ,2        ,92   ,406
325 ,5     ,1        ,1         ,2       ,66       ,3        ,92   ,|
326 ,5     ,1        ,1         ,2       ,66       ,4        ,22   ,0ddCo0nt