datamade / chi-councilmatic

:eyes: keep tabs on Chicago city council
https://chicago.councilmatic.org/
MIT License
21 stars 16 forks source link

Chicago address regex misses these addresses #201

Open stevevance opened 7 years ago

stevevance commented 7 years ago

intersection

Ordinance: https://chicago.councilmatic.org/legislation/o-2017-2090/ Location: E 71st St and S Stony Island Ave

Interpretation: E 71st St and S St (this gets cut off at St in Stony), and thus it's geocoded wrong and the map shows the wrong location.

Regex pattern for addresses: (\S*[a-z]\S*\s){1,4}?

address without suffix/type

Ordinance: https://chicago.councilmatic.org/legislation/o-2017-2210/ Location: 6145-6149 N Broadway

Interpretation: No address found because it doesn't have a suffix/type ("St", "Ave", etc.)

If you change the address pattern to ((?<!-)\b\d{1,5}(-\d{1,5})?\s(\S*[a-z]\S*\s){1,4}(ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway)) it should capture both edge cases.

address with unlisted suffix/type

Zoning Reclassification Map No. 1-G at 1107 W Fulton Market - App No. 18139T1 - in this case, "Market" is the same as "Avenue", but the city doesn't abbreviate it.

Add "market" to the list of suffixes in the address pattern capture group.

Long addresses

Take this ordinance title: Zoning Reclassification Map No. 16-E at 6311 S Calumet Ave, 6301-6335 S Calumet Ave, 343-365 E 63rd St and 6300-6334 S Dr. Martin Luther King Jr Dr The regex will capture all but the last address (on King Drive) because it only allows for up to 4 words to be captured. It needs to be increased to 6 words to be able to capture as far as "Jr".

stevevance commented 7 years ago

This is what I'm using now:

stname_pattern = (\S*[a-z]\S*\s){1,6}
sttype_pattern = (ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market|o)

That "o" in there is to catch the street name "Avenue O". It's imperfect but it works for now. It won't catch the other Avenue [letter] street names, though.

fgregg commented 7 years ago

want to make a PR?

stevevance commented 7 years ago

Any regex should catch the following numbered addresses:

  1. "Sale of City-owned property at 1105-1111 E 95th St to 95th St Building LLC" -> "1105-1111 E 95th St" this is the toughest one in this group because it matches 1105-1111 E 95th St to 95th St instead.
  2. "Zoning Reclassification Map No. 9-F at 3817-3845 N Broadway and 731-735 W Sheridan Rd - App No 18505" -> (1) 3817-3845 N Broadway, (2) 731-735 W Sheridan Rd
  3. "Zoning Reclassification Map No. 9-K at 3911-3985 N Milwaukee Ave and 4671-4777 W Irving Park Rd - App No. 18266" -> (1) 3911-3985 N Milwaukee Ave, (2) 4671-4777 W Irving Park Rd
  4. "Zoning Reclassification Map No. 8-E at 3401-3453 S Dr. Martin L. King Dr. and 400-506 E 35th St - App No. 18604" -> (1) 3401-3453 S Dr. Martin L. King Dr. this is another tough one, (2) 400-506 E 35th St
  5. "13333 S Avenue L"
  6. "Sale of City-owned property at 8906 S Lowe Ave to Alfred Wayne Daniels and Marcella Daniels under Adjacent Neighbors Land Acquisition Program" -> "8906 S Lowe Ave"
  7. " Sale of City-owned property at 437 N Monticello Ave to Terrance P. Klees" -> "437 N Monticello Ave" but it captures "437 N Monticello Ave to Ter"

This regex pattern captures addresses 2-5 only (although I believe it can be simplified):

((?<!-)\b\d{1,5}(-\d{1,5})?\s([.0-9a-z]*\s){1,5}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market))

The intersection pattern fails on "Zoning Reclassification Map No. 22-F at W 87th St and S State St and E 88th St and S Lafayette Ave"

((?<=\sat\s)(\S*[a-z]\S*\s){1,6}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market)([ ,-.]|\b)\s?and\s?(\S*[a-z]\S*\s){1,6}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market)([ ,-.]|\b))

Here's the full pattern:

/(((?<!-)\b\d{1,5}(-\d{1,5})?\s(\S*[a-z]\S*\s){1,6}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market)([ ,-.]|\b))|((?<=\sat\s)(\S*[a-z]\S*\s){1,6}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market)([ ,-.]|\b)\s?and\s?(\S*[a-z]\S*\s){1,6}(avenue [a-z]|ave|blvd|cres|ct|dr|hwy|ln|pkwy|pl|plz|rd|row|sq|st|ter|way|broadway|market)([ ,-.]|\b)))/i
herbiemarkwort commented 7 years ago

This seems to work for all 7 tests given. Will fail in the following example (though perhaps irrelevant to Chicago): "42 Webster Dr. and 101 Forest Dr."

((?<!-)\b\d{1,5}(-\d{1,5})?\s([.0-9a-z]*\s){1,5}(avenue [a-z]|ave\.?|blvd\.?|cres\.?|ct\.?|dr\.?|hwy\.?|ln\.?|pkwy\.?|pl\.?|plz\.?|rd\.?|row\.?|sq\.?|st\.?|ter\.?|way\.?|broadway|market)(?:\s|$))

stevevance commented 7 years ago

@herbiemarkwort That pattern fails on the 1st address "Sale of City-owned property at 1105-1111 E 95th St to 95th St Building LLC", which should match 1105-1111 E 95th St but it matches 1105-1111 E 95th St to 95th St.

But it catches all the rest, including the 6th and 7th addresses, which my pattern couldn't match.

And, "42 Webster Dr. and 101 Forest Dr." doesn't contain valid addresses in Chicago because they don't have a cardinal direction (NSEW).

herbiemarkwort commented 7 years ago

((?<!-)\b\d{1,5}(?:-\d{1,5})?\s(?:(?:n|north|w|west|s|south|e|east)\s(?:[.0-9a-z]*\s)?(?:[.a-z]*\s){,4}(?:avenue [a-z]|ave\.?|blvd\.?|cres\.?|ct\.?|dr\.?|hwy\.?|ln\.?|pkwy\.?|pl\.?|plz\.?|rd\.?|row\.?|sq\.?|st\.?|ter\.?|way\.?|broadway|market))(?:\s|$))