NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

postal_parse_address() performance #48

Closed lmores closed 4 months ago

lmores commented 4 months ago

Not ready for merge, just opening to share the code with @NickCrews and @jstammers following discussion in #47 (is there a better way?).

To run the benchmarks just execute python -m benchmarks.parse_address. This will print the timings on the screen and store the DuckDB files inside benchmarks/db/<TIMESTAMP>.

Here are my results:

$ python -m benchmarks.parse_address
_parse_fn__noop           took  14.6711 seconds
_parse_fn__python_only    took  15.4183 seconds
_parse_fn__postal_only    took  87.2642 seconds
_parse_fn__complete       took  88.1358 seconds
postal_parse_address      took 107.9780 seconds

My laptop specs:

OS: Ubuntu 21.04 x86_64 
Host: Latitude 5500 
Kernel: 5.11.0-49-generic 
CPU: Intel i7-8665U (8) @ 4.800GHz 
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620] 
Memory: 8067MiB / 15813MiB 

Note: _parse_fn__complete has (at least) one bug. When the result dictionary is converted to an ibis/DuckDB data structure, keys and values are shuffled (maybe the code that makes the conversion relies on the dictionary entries to have the same order as the fields declared in ADDRESS_SCHEMA?). Here are the first rows of the table I get:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ _parse_fn__complete_0(full_address)                                              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ struct<street1: string, street2: string, city: string, state: string, postal_co… │
├──────────────────────────────────────────────────────────────────────────────────┤
│ {'street1': 'anchorage', 'street2': 'alaska', ... +4}                            │
│ {'street1': 'juneau', 'street2': 'alaska', ... +4}                               │
│ {'street1': 'anchorage', 'street2': 'alaska', ... +4}                            │
│ {'street1': 'anchorage', 'street2': 'alaska', ... +4}                            │
│ {'street1': 'anchorage', 'street2': 'alaska', ... +4}                            │
└──────────────────────────────────────────────────────────────────────────────────┘
jstammers commented 4 months ago

This looks very promising, thanks for starting this @lmores .

I wonder if we can avoid the problem you've highlighted by explicitly returning them in order at the end

{
    "street1": result["street1"],
    "street2": result["street2"],
    ...
}

although this feels unnecessary as I would expect the fields to be inferred correctly from the schema. This could be an upstream bug with ibis - here's what I see with a toy example where I reverse the order of the fields in the address

import ibis
from ibis.expr import datatypes as dt
ibis.options.interactive = True

import pandas as pd

ADDRESS_SCHEMA = dt.Struct(
    {
        "street1": "string",
        "street2": "string",
        "city": "string",
        "state": "string",
        "postal_code": "string",
        "country": "string",
    }
)

@ibis.udf.scalar.python
def reversed_address() -> ADDRESS_SCHEMA:
    return {
        "country": "country",
        "postal_code": "postal_code",
        "state": "state",
        "city": "city",
        "street2": "street2",
        "street1": "street1",
    }

@ibis.udf.scalar.python
def address() -> ADDRESS_SCHEMA:
    return {
        "street1": "street1",
        "street2": "street2",
        "city": "city",
        "state": "state",
        "postal_code": "postal_code",
        "country": "country",
    }

table = ibis.memtable(pd.DataFrame({"ix":[1,2,3,4]}))
table.mutate(address=address(), rev_address=reversed_address())

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ix    ┃ address                                                                          ┃ rev_address                                                                      ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ struct<street1: string, street2: string, city: string, state: string, postal_co… │ struct<street1: string, street2: string, city: string, state: string, postal_co… │
├───────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│     1 │ {'street1': 'street1', 'street2': 'street2', ... +4}                             │ {'street1': 'country', 'street2': 'postal_code', ... +4}                         │
│     2 │ {'street1': 'street1', 'street2': 'street2', ... +4}                             │ {'street1': 'country', 'street2': 'postal_code', ... +4}                         │
│     3 │ {'street1': 'street1', 'street2': 'street2', ... +4}                             │ {'street1': 'country', 'street2': 'postal_code', ... +4}                         │
│     4 │ {'street1': 'street1', 'street2': 'street2', ... +4}                             │ {'street1': 'country', 'street2': 'postal_code', ... +4}                         │
└───────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┘

Maybe for now, the solution is to explicitly enforce the order of the fields in the returned object.

Another edge-case to test would be one in which multiple values for the same field are returned. Here's an example for an address that looks correct, but because the city and county are swapped, the parsed result contains multiple city components

parse_address("123 Elm Street, Los Angeles, Burbank, CA 91501")
[('123', 'house_number'),
 ('elm street', 'road'),
 ('los', 'city'),
 ('angeles', 'suburb'),
 ('burbank', 'city'),
 ('ca', 'state'),
 ('91501', 'postcode')]

I expect these are more likely to be invalid parses than correct ones, but worth noting.

NickCrews commented 4 months ago

Thanks for this @lmores , this looks awesome!

That is a bug with ibis, I filed an issue at https://github.com/ibis-project/ibis/issues/9613. If possible, can we keep this PR/discussion just related to perf/benchmarking as much as possible?

Here is my interpretation of the benchmarks:

so what this says to me is the time of this UDF can be accounted as the sum of

What do you think of this interpretation?

If this is correct, then trying to optimize the python is going to be fruitless, and we should just go with the most idiomatic version we can.

Even if we don't change our implementation, I still think this benchmarking harness code is useful and we should add it

lmores commented 4 months ago

@NickCrews: I agree with your interpretation of the results, but not on the conclusions XD; and here is why: all functions named _parse_fn__* have been obtained by modifying my implementation of the UDF, not the current one! (my bad, should have mentioned it).

Timings must be read as follows:

$ python -m benchmarks.parse_address
_parse_fn__noop           took  14.6711 seconds
_parse_fn__python_only    took  15.4183 seconds    # my implementation, not calling pypostal
_parse_fn__postal_only    took  87.2642 seconds
_parse_fn__complete       took  88.1358 seconds    # my implementation
postal_parse_address      took 107.9780 seconds    # current implementation

Since current implementation (postal_parse_address) takes 107 seconds and my implementation (_parse_fn__complete) takes 88 seconds, we would gain ~20 seconds switching to the new one (and your analysis proves that we cannot improve anymore with plain python as _parse_fn__complete - _parse_fn__postal_only ~= .9s).

Does it make sense?

NickCrews commented 4 months ago

Ahh, thank you, yes that makes sense. Let's add some tests, fixup your implementation, and get it merged! Exciting!

NickCrews commented 4 months ago

FYI I just pushed https://github.com/NickCrews/mismo/commit/ac10b6c60f0e8069f882f3cdfbe0d3f59023c784 that adjusts the current implementation a bit, but shouldn't affect runtime

I am looking back at my old notes from https://github.com/NickCrews/mismo/issues/19, and I think I want to take this opportunity to actually use pytest-benchmark for this. So I'm tweaking this PR right now locally to use this framework, so hold off on any edits you're making. Thanks!

NickCrews commented 4 months ago

lol, actually we already have pytest-benchark set up. grep for benchmark and you can find examples. Still, I'm migrating this PR to use that.

NickCrews commented 4 months ago

The failing tests are unrelated, I will fix those up separately.

@lmores this is ready to merge whenever! You need to mark this PR as not WIP for me to be able to merge it.

NickCrews commented 4 months ago

nevermind I was able to remove the WIP.

If you want, I'd love to see you actually change the implementation to the faster version in a followup PR!