cfpb / grasshopper

CFPB's streaming batch geocoder
Creative Commons Zero v1.0 Universal
37 stars 13 forks source link

Difference Between Bulk and Single Line Gecoding #204

Open kgudel opened 8 years ago

kgudel commented 8 years ago

Given the addresses file in the test harness (~4350 rows) I found examples where I have run the batch geocoder and addresses have not been found which are found by the states address points geocoder when individually geocoding the addresses. Following are some examples:

7437 South 1550 East SOUTH WEBER UT 84405 7473 South 1550 East SOUTH WEBER UT 84405 9 East 750 South FARMINGTON UT 84025 871 West Brandon Drive KAYSVILLE UT 84037 857 West Brandon Drive KAYSVILLE UT 84037 521 Wharton Road Lowell AR 72745 2 Arrow Brook Court Little Rock AR 72227 4 Amherst Cove Little Rock AR 72205 43 Temple Court Northwest WASHINGTON DC 20001 45 K Street Northwest WASHINGTON DC 20001 2 Lupine Lane SOUTH BURLINGTON VT 05403 57 Munson Drive WILLISTON VT 05495

In case it is relevant this is also after I made some changes to the geocoder by adding more synonyms to the loader and changing the file in grasshopper which builds the elasticsearch query by adding proximity searching.

Specifically changing:

private def searchAddressFields(client: Client, index: String, indexType: String, number: String, streetName: String, city: String, state: String, zipCode: String): Array[SearchHit] = { val numberQuery = QueryBuilders.matchQuery("number", number) val streetQuery = QueryBuilders.matchPhraseQuery("street", streetName) val cityQuery = QueryBuilders.matchQuery("city", city) val stateQuery = QueryBuilders.matchQuery("state", state) val zipQuery = QueryBuilders.matchQuery("zip", zipCode)

val query = QueryBuilders .boolQuery() .must(numberQuery) .must(streetQuery) //.must(cityQuery) Removing for now, decreases response rate if data is not 100% accurate .must(stateQuery) .must(zipQuery)

to

private def searchAddressFields(client: Client, index: String, indexType: String, number: String, streetName: String, city: String, state: String, zipCode: String): Array[SearchHit] = { val numberQuery = QueryBuilders.matchQuery("number", number) val streetQuery_strict = QueryBuilders.matchPhraseQuery("street", streetName) val cityQuery = QueryBuilders.matchQuery("city", city) val stateQuery = QueryBuilders.matchQuery("state", state) val zipQuery = QueryBuilders.matchQuery("zip", zipCode) val streetQuery_loose = QueryBuilders.matchQuery("street", streetName)

val query = QueryBuilders .boolQuery() .must(numberQuery) //.must(streetQuery_strict) .must(streetQuery_loose) //.must(cityQuery) Removing for now, decreases response rate if data is not 100% accurate .must(stateQuery) .must(zipQuery) .should(streetQuery_strict)

kgudel commented 8 years ago

There is also a difference when running it with a match_phrase query (instead of a match query) on the number, QueryBuilders.matchPhraseQuery("number", number) instead of QueryBuilders.matchQuery("number", number). This breaks batch, causing it to find no addresses. However, a single line query still works and finds addresses.