Open fgregg opened 3 years ago
having poked around on the OCR, here are some of the key issues/regularites i've noted:
all of this creates problems for efficient/effective OCR. The physical layout of the data on the page is super significant, and all of the OCR systems I've looked at don't do a great job of preserving it. I ended up breaking down the structure of the page as best I could using image analysis and applying OCR only to what I was able to identify as an "address mapping row" or a "street name", rather than trying to OCR the entire page and then extracting the structure from that. I do try to retain the bounding boxes of everything, whether successfully processed or rejected, so it could be referenced back to the image if necessary.
all that said, i haven't tried to estimate how many actual address mappings are in the entire document and compare that number to what I've been able to extract, but successful OCR of a mapping isn't the primary measure of quality, probably ease of checking/correcting the transcription is just as important.
Adding in some additional issues I can see with the original PDF:
@tewhalen what is the OCR you are using for this/I am curious to see how it is working.
@tewhalen what is the OCR you are using for this/I am curious to see how it is working.
The code is in github.com/tewhalen/1909 - it's a huge mess, with lots of debugging and commented out false starts, sorry. it uses tesseract as an OCR. I toyed with trying to train some ML based OCR system to do a better job reading the type, but I wasn't quite able to figure that out. The basic strategy I landed on was to:
- occasionally an old address number is just ", which I think means it's the same address as above.
this is almost certainly the case. I believe it usually occurs when a single old address is split into multiple new addresses (it may also happen when several old addresses are combined to a single new address, but I don't recall any examples of that)
do we have ice that is good enough so it’s more efficient to check output than to do fresh transcription