cfpb / grasshopper

CFPB's streaming batch geocoder
Creative Commons Zero v1.0 Universal
37 stars 13 forks source link

API - always return the input value #149

Open awolfe76 opened 9 years ago

awolfe76 commented 9 years ago

When running this query: /geocode/200 President St Arkansas City AR 71630

Everything is great and you see the input "input": "200 President St Arkansas City AR 71630",


But, If you query: /geocode/200 President St

You also get a result but the input is: "input": "",

The input in this case should be: "input": "200 President St",


I think that the input should be populated even if no results are returned as well.

hkeeler commented 9 years ago

I was going to take a stab at this one for this sprint, but I think its worth a bit of discussion first as there are several variables at play here. Below are samples of /geocode requests with with successful and failed address parsings. Thoughts on options for improving this coming soon...

Success

Request

GET /geocode/200 President St Arkansas City AR 71630

Response


{
  "status": "OK",
  "query": {
    "input": "200 President St Arkansas City AR 71630",
    "parts": {
      "city": "Arkansas City",
      "zip": "71630",
      "state": "AR",
      "streetName": "President St",
      "addressNumber": "200"
    }
  },
  "addressPointsService": {
    "status": "OK",
    "features": [{
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [-91.19978780015629, 33.608091616155995]
      },
      "properties": {
        "address": "200 President St Arkansas City AR 71630",
        "alt_address": "",
        "match": 1.0
      }
    }]
  },
  "censusService": {
    "status": "OK",
    "features": [{
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [-91.19960153268617, 33.60763673811005]
      },
      "properties": {
        "RFROMHN": "100",
        "RTOHN": "498",
        "ZIPL": null,
        "FULLNAME": "President St",
        "LFROMHN": null,
        "LTOHN": null,
        "ZIPR": "71630",
        "STATE": "AR"
      }
    }]
  }
}

Fail

Request

GET /geocode/200 President St

Response


{
  "status": "OK",
  "query": {
    "input": "",
    "parts": {
      "city": "",
      "zip": "",
      "state": "",
      "streetName": "",
      "addressNumber": ""
    }
  },
  "addressPointsService": {
    "status": "OK",
    "features": [{
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [-91.19978780015629, 33.608091616155995]
      },
      "properties": {
        "address": "200 President St Arkansas City AR 71630",
        "alt_address": "",
        "match": 0.41025641025641024
      }
    }]
  },
  "censusService": {
    "status": "ADDRESS_NOT_FOUND",
    "features": []
  }
}
awolfe76 commented 9 years ago

I think I get where you're coming from, but my thought was that no matter what, as long as we can return a response at all, we should return the query.input value.

Meaning that even if all services fail we should see something like:

GET /geocode/200 President St

{
  "status": "OK",
  "query": {
    "input": "200 President St",
    "parts": {
      "city": "",
      "zip": "",
      "state": "",
      "streetName": "",
      "addressNumber": ""
    }
  },
  "addressPointsService": {
    "status": "ADDRESS_NOT_FOUND",
    "features": []
  },
  "censusService": {
    "status": "ADDRESS_NOT_FOUND",
    "features": []
  }
}

It might be useful to somehow show the parser status as well, rather than, or in addition to, the parts returning empty strings.

Maybe break out the parts as its own object? That is the parser, right?

If I'm right, maybe:

GET /geocode/200 President St

{
    "status": "OK",
    "input": "200 President St",
    "parserService": {
        "status": "ADDRESS_NOT_PARSED",
        "parts": {}
    },
    "addressPointsService": {
        "status": "ADDRESS_NOT_FOUND",
        "features": []
    },
    "censusService": {
        "status": "ADDRESS_NOT_FOUND",
        "features": []
    }
}

Basically as long as the geocode is "status": "OK", we should, at a minimum return the input.

Maybe that's what you meant by

there are several variables at play here.

:smile:

hkeeler commented 9 years ago

@awolfe76, I actually wasn't proposing anything yet. :) The examples above are how the service works today. I just put those up as reference for further discussion. Now that I read my comment, that isn't particularly clear.

I tend to agree that we should move input outside of the parser. It will probably make the code simpler on the geocoder if that's the case as well.

More thoughts on this to come...

awolfe76 commented 9 years ago

Understood. Just wanted to throw out a few ideas too.

hkeeler commented 9 years ago

I took a few stabs at this a couple weeks ago, but each approach has some snags, and requires a fair amount of refactoring.

Take 1: Make input top-level attribute

My first stab was actually refactor the format as we discussed

{
    "status": "OK",
    "input": "200 President St",
    "parserService": {
        "status": "ADDRESS_NOT_PARSED",
        "parts": {}
    },
    "addressPointsService": {
    ...
}

...but there is already quite a lot of code that is dependent on the input coming along for the ride with the parser's response. I started to refactor accordingly, but then decided to hold off as it was starting to be quite a lot of change.

Take 2: Keep current format, manually set input on failed parse

I got this working pretty easily for the standard GET /geocode request, but I ran into snags with GeocodeFlow. In the following function...

  def parseFlow: Flow[String, ParsedAddress, Unit] = {
    Flow[String]
      .mapAsync(4)(a => AddressParserClient.standardize(a))
      .map { x =>
        if (x.isRight) {
          x.right.getOrElse(ParsedAddress.empty)
        } else {
          ParsedAddress.empty
        }
      }
  }

...AddressParserClient.standardize(a) returns either a ResponseError (the Left) or a ParsedAddress (the Right). Unfortunately, ResponseError is very generic, and does not contain the input address string, so you're out of luck.

I'm not going to make any changes at this point. I think the better option is to do the full refactoring necessary for moving input up to a top-level attribute, and refactoring the Flow, which is necessary in either scenario.

awolfe76 commented 9 years ago

[GHE]/HMDA-Operations/hmda-devops/issues/79