Ironholds / poster

Address parsing and normalisation through libpostal
MIT License
59 stars 9 forks source link

parse_addr returns NAs #5

Closed dantonnoriega closed 7 years ago

dantonnoriega commented 8 years ago

To start off, the normalise_addr works fine.

library(poster)
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin14.5.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] poster_0.2.0   colorout_1.1-2
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5       formatR_1.4        tools_3.3.0       
##  [4] htmltools_0.3.5    yaml_2.1.13        Rcpp_0.12.7       
##  [7] stringi_1.1.1      rmarkdown_1.0.9002 knitr_1.14        
## [10] stringr_1.1.0      digest_0.6.10      evaluate_0.9
str(normalise_addr("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"))
##  chr "781 franklin avenue crown heights brooklyn nyc new york 11216 usa"

Its parse_addr that is funky.

str(poster::parse_addr("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"))
## 'data.frame':    1 obs. of  10 variables:
##  $ house         : chr NA
##  $ house_number  : chr NA
##  $ road          : chr NA
##  $ suburb        : chr NA
##  $ city_district : chr NA
##  $ city          : chr NA
##  $ state_district: chr NA
##  $ state         : chr NA
##  $ postal_code   : chr NA
##  $ country       : chr NA

Anyhow, sad it isn't working. Not the end of the word as I think the normalizer will suffice. Just thought I'd inform!

Here is what I've done already:

❯ make check
Making check in src
Making check in sparkey
make[2]: Nothing to be done for `check'.
./libpostal_data download all /Users/dnoriega/libpostal
Checking for new libpostal data file...
libpostal data file up to date
Checking for new libpostal geodb data file...
libpostal geodb data file up to date
Checking for new libpostal parser data file...
libpostal parser data file up to date
Checking for new libpostal language classifier data file...
libpostal language classifier data file up to date
Making check in test
/Library/Developer/CommandLineTools/usr/bin/make  check-TESTS
PASS: test_libpostal
============================================================================
Testsuite summary for libpostal 0.3
============================================================================
# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[1]: Nothing to be done for `check-am'.
❯ ./libpostal "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"
781 franklin avenue crown heights brooklyn nyc new york 11216 usa
781 franklin avenue crown heights brooklyn nyc ny 11216 usa

and

❯ ./address_parser
Loading models...

Welcome to libpostal's address parser.

Type in any address to parse and print the result.

Special commands:

.language [code] to specify a language
.country [code] to specify a country
.exit to quit the program

> "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"

Result:

{
  "house": "\"",
  "house_number": "781",
  "road": "franklin ave",
  "suburb": "crown heights",
  "city_district": "brooklyn",
  "city": "nyc",
  "state": "ny",
  "postcode": "11216",
  "country": "usa",
  "house": "\""
}

Here is maybe relevant computer info:

❯ sw_vers
ProductName:     Mac OS X
ProductVersion: 10.11.6
BuildVersion: 15G1004
❯ brew info gcc
gcc: stable 6.2.0 (bottled), HEAD
GNU compiler collection
https://gcc.gnu.org
/usr/local/Cellar/gcc/6.2.0 (1,435 files, 281.9M) *
lukeholman commented 8 years ago

Hi there,

Just adding a note to say that I've replicated this issue in its entirety - I've done all the same checks, and normalise_addr works but not parse_addr.

Cheers!

Ironholds commented 8 years ago

Weird! Thanks; will be looking into it over the next couple of days.

On Tuesday, 20 September 2016, Luke Holman notifications@github.com wrote:

Hi there,

Just adding a note to say that I've replicated this issue in its entirety

  • I've done all the same checks, and normalise_addr works but not parse_addr.

Cheers!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ironholds/poster/issues/5#issuecomment-248469212, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXz3naLOY0t-tOiAdKk83OUBBNKl7noks5qsG_cgaJpZM4J3g7O .

Ironholds commented 8 years ago

Welp, looking into it now and I...can't get libpostal to install on my macbook :|

Looking through the code I can basically think of two possible sources of the bug; either (1) libpostal is returning empty strings (in which case they get turned into NAs) or (2) the API changed and so it isn't actually fetching any of the entries (and is thus returning NAs). I'll be able to eliminate one or both possibilities once I've got it installed (I assume the CLI works fine?)

lukeholman commented 8 years ago

Yeah, on my computer the CLI works fine (I guess you mean the interactive shell for the parser that you get by typing .address_parser or whatever it is into Terminal). But maybe it returns the address elements in a format that the R-C interface doesn't recognise? Seems like the R library has some internal functions that get the elements (house, city, etc) of the address from the parser output, and pass them to the R parser function, and these might not be working ok, but I couldn't figure it out.

Cheers

Ironholds commented 8 years ago

Yeah - say, could you paste some example output for the interactive shell?

lukeholman commented 8 years ago

Sure, here you go!

> 18 Smith Road, London, UK

Result:

{
  "house_number": "18",
  "road": "smith road",
  "city": "london",
  "country": "uk"
}

> 1600 Amphitheatre Parkway Mountain View, CA 94043 Phone: +1 650-253-0000

Result:

{
  "house_number": "1600",
  "road": "amphitheatre parkway",
  "city": "mountain view",
  "state": "ca",
  "postcode": "94043",
  "house": "phone +1 650-253-0000"
}
Ironholds commented 8 years ago

Thiiis is super weird. The only explanation I have right now is that the function that actually does the retrieval is...somehow not being called? Which, what the fuck?

lukeholman commented 8 years ago

Somedays R just doesn't want to cooperate! Is it possible to call the internal 'get elements' function (I forget its exact name) on its own and test that? I couldn't get it to work (it actually crashes RStudio and gives the little bomb icon when run :) ).

Ironholds commented 8 years ago

I can't see any Rcpp changes that would've caused this to suddenly fall over. @thatdatabaseguy did you make any API changes or whatnot in the last release? (I don't see a changelog or I'd check that, and I know you were doing a big rewrite)

Ironholds commented 8 years ago

I can probably export that function and see what'll happen, yeah - that's my next port of call! How did you export it that it blew up?

Ironholds commented 8 years ago

Okay, I called it internally, it doesn't blow up, so the libpostal side of things is working so what in the hell is going on?!

Ironholds commented 8 years ago

Actually if you called the getters and setters they would've crashed on account of not working (hence the lack of export)

Ironholds commented 8 years ago

Wait, I know what it bloody is. The last Rcpp release makes get_cstring() vanish. This was, of course, not documented.

Ironholds commented 8 years ago

Well, that's a problem but it doesn't fix it. hrrn.

Ironholds commented 8 years ago

Okay, I should've fixed it. If someone else seeing the bug can replicate the fix on the latest commit, I'd be most grateful. I still have no idea what caused it.

dantonnoriega commented 8 years ago
> library(poster)
> 
> str(normalise_addr("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"))
 chr "781 franklin avenue crown heights brooklyn nyc new york 11216 usa"
> str(poster::parse_addr("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"))
'data.frame':   1 obs. of  10 variables:
 $ house         : chr NA
 $ house_number  : chr "781"
 $ road          : chr "franklin ave"
 $ suburb        : chr "crown heights"
 $ city_district : chr "brooklyn"
 $ city          : chr "nyc"
 $ state_district: chr NA
 $ state         : chr "ny"
 $ postal_code   : chr NA
 $ country       : chr "usa"

Quick rerun and seems to work better, but a little disconcerting that postal_code doesn't seem to work (I would think that would be one of the easier ones for the wee algorithm).

albarrentine commented 8 years ago

Ah, R binding was using the wrong field name (the C library calls it "postcode"). Pull latest and that address should parse to the same result as in the command-line client output above.

dantonnoriega commented 8 years ago

Nice!

> poster::parse_addr("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA")
  house house_number         road        suburb city_district city
1  <NA>          781 franklin ave crown heights      brooklyn  nyc
  state_district state postal_code country
1           <NA>    ny       11216     usa
Ironholds commented 8 years ago

Nicely done, Al! Any idea why the code stopped working randomly? It's the weirdest thing - if you nest the call 4 functions deep, it never runs. 3 functions deep, it does. wut.

albarrentine commented 8 years ago

Like if you call foo, which calls bar, which calls blah which calls poster::parse_addr, it returns NAs?

Ironholds commented 8 years ago

More precisely it simply doesn't appear to run; it returns NAs because the output is preallocated with NAs a call above.

On Sunday, 25 September 2016, Al Barrentine notifications@github.com wrote:

Like if you call foo, which calls bar, which calls blah which calls poster::parse_addr, it returns NAs?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ironholds/poster/issues/5#issuecomment-249431587, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXz3gtW2mR1LG5oOuX2o59MN7dg7Inrks5qtqLSgaJpZM4J3g7O .

albarrentine commented 8 years ago

An example would be helpful. If the inputs to parse_addr are NA, it will do nothing, but allocation of the outputs happens within the function's scope so ought to be independent of any previous calls.

lukeholman commented 8 years ago

Hi all, I just got around to installing the new version and testing it out. Works great for me too! Thanks for the help - now to test it on a big mass of data!