Ironholds / poster

Address parsing and normalisation through libpostal
MIT License
59 stars 9 forks source link

updates for libpostal 1.0 #6

Closed albarrentine closed 7 years ago

albarrentine commented 7 years ago

Ahoy Oliver!

We have a 1.0 for libpostal as of earlier today. Conditional Random Fields trained on 1 billion addresses, 99.45% full-parse accuracy on held-out, and a saucy new GIF.

Since I was bumping the major version anyway, I took the time to add a "libpostal_" prefix to everything in our header, as is generally good practice but I neglected to do before releasing the initial version.

This PR implements all the necessary API changes for 1.0, and adds the new parser labels, which number 20 now. I don't have an R environment set up to test it on, but think I got everything.

Also, some possibly relevant details for CRAN purposes: 1.0 removes the snappy, sparkey and mmap dependencies (compilation on Windows is maybe-sorta possible). In this version I trained the language classifier with the FTRL-Proximal method instead of L2 regularization. The former combines the L1 and L2 penalty to encourage sparsity while keeping a . Suffice to say, the new language classifier is 1/10th the size (only 70ish MB now instead of 700), and actually a bit more accurate than its predecessor. The new address parser is larger, like 1.8GB on-disk/in-memory, but that's mostly because it trains on 10x more data, though there are some sparsity tricks there as well to keep its size under control. That is to say, the size is now under 2GB (unzipped, download size is around 750MB) if that number magically changes anything.

Cheers! ./al

Ironholds commented 7 years ago

Sweet; thanks! I'll open a thread to see if it makes a difference to CRAN, although tbh I suspect the answer is likely to be 'nope'. Generally they'd almost certainly wanna see a really, really streamlined form :(. But I'll find out!

BenK10 commented 7 years ago

Great!

We are interested in using libpostal on Windows. Now that the snappy and sparkey dependencies have been removed, this should be easier. Any suggestions on how to go about this? We don't want to build from MinGW since we use Visual Studio.

libpostal is not thread-safe but is there any way we can make it so by adding mutexes and such to the code? Would this be a major rewrite? We have a use case of worker threads parsing several addresses each (not a lot) so we would like to be able to load the models once and share them across threads.

Ironholds commented 7 years ago

These questions seem to be for libpostal proper, rather than the R bindings, no?

albarrentine commented 7 years ago

@BenK10 Windows is maybe possible now. As I've said before, I have zero Windows dev machines and frankly have neither used nor thought about Windows in years, but happy to accept patches if anyone gets a build working.

In saying libpostal is not thread-safe, it doesn't mean that there's no possible way to access it from a threaded program, it just means "it's up to you, the caller, to wrap your calls to libpostal in a mutex or else the behavior is undefined." The Java binding for instance can be used within a threaded server because all the calls to libpostal are synchronized methods.

The C library itself shouldn't have to rope in pthread, add the overhead of holding mutexes per call, and make all that work cross-platform just for the few people who are using it from threaded environments (that would hurt performance in the single-threaded case e.g. Python or NodeJS). As far as making libpostal multi-threaded, the only good reason to do that would be for performance and the library can already do > 10k addresses per second. In anything other than optimized C/C++, the calling code is almost certainly slower than that and if there were a situation where libpostal was a bottleneck, that would mean the workload was large enough to be pegging the CPU to the ceiling (not ideal in most multithreaded environments) which would probably call for sharding across multiple processes/machines anyway.

BenK10 commented 7 years ago

@thatdatabaseguy Let's say I use the C API. In a threaded program, would each thread have to do its own setup/teardown of libpostal? Is it possible to instead have a master thread that does the setup once, then spawns worker threads that do mutexed calls to parse_address()? Put another way, is there a way for the child threads to inherit their parent's context so they don't need to each load their own copy of libpostal?

albarrentine commented 7 years ago

@BenK10 definitely, that's how it's intended to work. The setup functions should be called once per process at the beginning of the program and the teardown functions should be called once at the end. In between it's fine to create multiple threads and call libpostal_parse_address from them as long as they implement their own locking.

BenK10 commented 7 years ago

I see. Thanks for clarifying that.