Noticed when detailing what exactly happens when a newsitem is geocoded: the base scraper class geocode_if_needed passes location_name or address_text (essentially) into the ebdata.nlp.addresses parse_addresses function which corrupts some of them.
Examples:
120 J D Murphy Ln --> 120 J
10763 James B White Hwy S --> 10763 James
10376 Rough N Ready Rd --> 10376 Rough N
3578 Old 74 --> 3578 Old
100 John L Riegel Rd --> 100 John
It isn't immediately obvious to me how to correct the regular expression used by parse_addresses. Possibly what it is doing is "correct" for the case of trying to extract an address from a large block of text that may or may not contain an address, I'm not sure. For our purposes, though, it seems things would work better if we only passed address_text through this nlp routine. I'm going to change our DataDashboard scraper mixin geocode_if_needed to do that. Later we may decide that it would make sense to try to push these changes back to base OpenBlock but right now I'm not sure enough of the code in this area to be sure of that.
Noticed when detailing what exactly happens when a newsitem is geocoded: the base scraper class geocode_if_needed passes location_name or address_text (essentially) into the ebdata.nlp.addresses parse_addresses function which corrupts some of them.
Examples: 120 J D Murphy Ln --> 120 J 10763 James B White Hwy S --> 10763 James 10376 Rough N Ready Rd --> 10376 Rough N 3578 Old 74 --> 3578 Old 100 John L Riegel Rd --> 100 John
It isn't immediately obvious to me how to correct the regular expression used by parse_addresses. Possibly what it is doing is "correct" for the case of trying to extract an address from a large block of text that may or may not contain an address, I'm not sure. For our purposes, though, it seems things would work better if we only passed address_text through this nlp routine. I'm going to change our DataDashboard scraper mixin geocode_if_needed to do that. Later we may decide that it would make sense to try to push these changes back to base OpenBlock but right now I'm not sure enough of the code in this area to be sure of that.