fgregg / chicago-historical-addresses

Digitizing crosswalks of historical Chicago addresses
3 stars 0 forks source link

exceptional cases #5

Open tewhalen opened 2 years ago

tewhalen commented 2 years ago

There are known types of exceptions to the simplest "mere renumbering" (on Street A, the number of a building changes from X to Y) that makes up the vast majority of the document.

gneidhardt commented 2 years ago

I can ask my coworker who has extensive experience with building history. I also would like to clarify the ditto marks - like Leyla I assume they mean "same as above" but I'd like to make sure. Any answers we come up with I'd be happy to add to the pdf in the meantime (I'm so mad that it doesn't have a legend, or explanatory text).

tewhalen commented 2 years ago
tewhalen commented 2 years ago
tewhalen commented 2 years ago
tewhalen commented 2 years ago
aoifefahey commented 2 years ago

I wonder if part of the reason for the lack of consistency was the manner in which this was typeset, i imagine that they probably had several different people working on the typesetting and they each had their own way of indicating things such as addresses which were combined/split. even if it was a single typesetter their approach may have changed over time. either way, i wouldn't necessarily assume consistency.

i do kind of wonder if the answer to a lot of these questions can be found by tracking down all the street names that have been changed. for instance, Avenue A became State Line Road, and there are a number of other roads which have changed names several times and understand where roads actually are could provide important context necessary to answer these questions.

another resource which might be helpful are historic maps with address info. I can't remember if there is a way to look up old 80 acre maps, but the contemporary ones tend to have a lot of information on them. the sandborn fire maps have addresses on them, and the old block numbers are often denoted on old maps, such as on Blanchard's Map (1906)

regardless, it might help if we knew how many edge cases we were dealing with. it might be worth it to simply flag them and deal with them manually.

gneidhardt commented 2 years ago

@aoifefahey @tewhalen when I brought up some of these with my colleague (who does a ton of house history interpretation for folks) she had much the same to say - that often these weren't consistent and it was much more an art than science. She also said she frequently double checks in a Sanborn map. Dennis McClendon has a great site that links to all known (public) digitized copies, and I'm excited to report that we have a few additional digitized ones at Chicago History that we hope will soon be public (they were proprietary digitizations up til now for...reasons? I think perhaps because they were digitized through Gale, but we're working on getting them publicly accessible). Regardless, if there are Sanborns missing from these resources, we have a fairly complete physical set and I'm happy to look whatever we need up in them, or we can always get a list together and have a Sanborn-a-thon at some point. Some are offsite at the moment, but should be back early 2022.

aoifefahey commented 2 years ago

@gneidhardt

often these weren't consistent and it was much more an art than science

@tewhalen

  • The old address contains a fraction.

because of all of the issues at hand, I think our best bet is to try to record the text from the book as accurately as possible, and then process it into whatever format we'd like afterwards. this allows us to divide the project into 2 parts: 1) getting the data out of the book 2) manipulating the data into whichever format we'd find most useful

part 1 is much more straightforward than part 2 and has a clearer end goal. (i.e. we won't get bogged down arguing which exact lot we are talking about with some of the more vague descriptions, or whenever streets migrated over time due to changes to the street grid, etc)

that being said, one thing we could try doing to solve some of these problems in a more automated fashion is to try to associate addresses in a second manner and then cross-reference them, such as referencing sanborn maps to current maps.gis information (or the 80 acre maps) and then using machine vision to record addresses for each lot.

one question i do have: is the quality of the scan part of the problem? i.e. how much time would it take for us to do a contemporary scan of the 1909 street renumbering plan with better lighting/higher resolution/a better camera, and would it be worth it?

tewhalen commented 2 years ago

simple

One does not simply "get the data out of the book"

I was mainly collecting these examples as an aid to developing a data model, either for hand-correcting "raw" data or for eventually storing the data in a useful way. For instance, we may decide the best way to capture the "raw" data is in three columns, but there's no obvious way to put the data into a format like that:

STREET          NEW_ADDRESS OLD_ADDRESS
------          ----------- -----------
Berwyn Avenue   1349 1359   } 1003
Berwyn Avenue   1415        975
Berwyn Avenue   1417        "
Berwyn Avenue   1062        1308
Berwyn Avenue   1064        1306
Berwyn Avenue   1066        1302
Berwyn Avenue   1338        2389 Wayne av
Berwyn Avenue   1432        958-60

Unless we provide clear instructions as to how to transcribe them, all of the exceptions will be completely free-form and would need additional later manual intervention. Indeed, the whole thing would need to be converted, by hand, into some other format. And it won't necessarily be obvious which rows would need manual intervention, or how to convert them to the new format without referring back to the original book.

aoifefahey commented 2 years ago

Unless we transcribe the book letter for letter 100% accurately people are still going to need to go back and look at it because it is the authoritative source for all of these records. I think an ELT process is more suitable here than an ETL process because it separates the process into two parts:

1.) accurately recording the information contained in the book 2.) taking that information, and putting it in a format that people can use to answer questions

each of these problems is difficult in its own right, but if we handle point 1 well we will at the very least remove the need to check the book manually.

for instance, given the following data: image

it might make sense to record it as follows:

54
West 43d Street
CONTINUED
Odd Nos. | Even Nos. 
New Old | New Old
1929 2035 | 738 838
1931 " | 740 840
[deliberately omitted]
2633 2733 | 1744 cor
2645 2745 | 1746 Wood

given that 95% of the addresses seem to be standardized, it might make sense to include a flag indicating whether a particular row appears to match the standard data format and to do some processing on that row:

for instance we might have a field list like such:

Page, Column, Row, Flag, Confidence, Field1, Field2,..., FieldN with flags indicating the following:

Flag,Description,Field1,Field2,Field3,Field4
s, standard address change field, new odd, old odd, new even, old even
e, standard address change field for even numbers: can be combined with other flag to indicate what is in other fields (blank, etc), new even, old even
o, standard address change field for odd numbers: can be combined with other flag to indicate what is in other fields (blank, etc), new odd, old odd

on the bright side, there are only 170-some odd pages, so even though there appears to be different rules for how things were recorded on each page, there's not that many variations. I do kind of wonder if we might be able to infer which typesetter did which page and group them. For instance, page 54 has the abbreviation cor all over the place, as well as cr, which I assume means the same thing, and I believe that's how the typesetter indicated that the street name changed.

tewhalen commented 2 years ago

I like the idea of manually flagging whether a row seems to be non-standard, but I still question whether there's any utility in attempting to make an "accurate" ASCII representation of the layout of the book. It seems to me if we're going to have a person manually check and hand-correct every page, we should have them enter in the proper number in place of any ditto marks, for instance. I don't see any value in doing that in a second pass.

aoifefahey commented 2 years ago

The idea would be to have people do as little thinking as possible during transcription process. This would hopefully result in a transcribed document that is a character for character recreation of the authoritative source, meaning that any questions about how to handle the data can then be answer solely by looking at the transcribed document, and you don't have any reason to doubt the transcriptions.

I'm not sure it's necessarily a desirable goal for a hobbyist work, but it's typically how the places I've worked handle data entry.

I think transcribing the ditto marks makes sense though, because they are effectively symlinks to another number, and by transcribing the ditto marks we preserve that symlink. It also means we eliminate the possibility of someone accidentally typing in the wrong number while translating the symlink/ditto mark to another number.

I may be overly neurotic about this, I'm used to working in a field where all manually recorded information was double-entered by different people because errors were completely unacceptable. However I do think it is important to consider whether we want this to be a copy of the authoritative record, or simply a derived record that requires going back and looking at the original scan if there are any questions about accuracy.

tewhalen commented 2 years ago

I think that overstates how authoritative the source in this case is.

Screen Shot 2021-11-15 at 12 19 00 PM

Here's an issue I found just now on Page 5 - see how 2539 is in this column twice – once, out of order and paired with 1296 and again a few rows later paired with a ditto mark. It seems to me that the first 2539 should most likely be 2529, and that the typesetters made an error. If we found this or any similar error at any point after manual correction of OCR, we'd have to go all the way back to the original scans to figure out what to do, since we could never be 100% certain that this wasn't just an OCR error that got missed. Or, we'd be instructing whoever is manually correcting the OCR to be sure to enter the erroneous number as is, in order to create an authoritative record. I don't know why that would be valuable. I can't imagine any end users of the data being interested in original typographical errors.

aoifefahey commented 2 years ago

If I'm dealing with flat files I often create a field that tracks corrections made to the text (because errors aren't at all uncommon in any text), but I simply prefer ELT workflows My concern is always that correcting errors as part of the transcription introduces as many errors as it addresses.

I like the idea of having an authoritative copy that we then modify so we can track the changes and see when/why/how they were changed, since that would not only allow us to understand the context of the changes, but also to roll any incorrect ones back as our understanding changes.

Since we are all volunteers though I'm more than happy to go with whatever method we think would be easiest to work with. I do think one of the advantages of a literal copy is that literally anyone can contribute, because they don't need to understand context beyond "type what you see" (which also opens to the door to crowdsource data like mechanical turk/recaptcha/etc to transcribe addresses that have been flagged by OCR as being unreadable/nonstandard)

fgregg commented 2 years ago

i see a lot of benefits in doing this in stages as @aoifefahey proposes.

a key piece part to making this workable is having a data model that is rich enough to capture what is relevant and to have clear instructions to avoid the free form problems that @tewhalen identifies as risks.

I think the next step is to take a few pages of the OCR output and figure out a spreadsheet model that we can agree to.

fgregg commented 2 years ago

opening up a new issue for the spreadsheet model (#8)