NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
13 stars 3 forks source link

Address parsing performance #47

Open lmores opened 2 months ago

lmores commented 2 months ago

Made a very rough benchmark on a dataset with 2_192_071 Italian street addresses (that unfortunately I am not allowed to share).

I replaced the current implementation of postal_parse_address() with

from postal.parser import parse_address as _parse_address

@ibis.udf.scalar.python
def postal_parse_address(address_string: str) -> ADDRESS_SCHEMA:
    # Initially, the keys match the names of pypostal fields we need.
    # Later, this dict is modified to match the shape of an `ADDRESS_SCHEMA`.
    result: dict[str, str | None] = {
        "house_number": None, "road": None, "unit": None,
        "city": None, "state": None, "postcode": None, "country": None
    }

    parsed_fields = _parse_address(address_string)
    for value, label in parsed_fields:
        # Pypostal returns more fields than the ones we actually need.
        # Here `False` is used as a placeholder under the assumption that
        # such value is never returned by pypostal a field value.
        current = result.get(label, False)

        # Keep only the fields declared when `result` is initialized.
        # Pypostal fields can be repeated, in such case we concat their values.
        if current is not False:
            result[label] = value if current is None else f"{current} {value}"

    # Hack to prepend "house_number" to "road"
    house_number = result.pop("house_number")
    if house_number is not None:
        road = result["road"]
        if road is None:
            result["road"] = house_number
        else:
            result["road"] = f"{house_number} {road}"

    # Modify `result` in-place to match the shape of an `ADDRESS_SCHEMA`.
    result["street1"] = result.pop("road")
    result["street2"] = result.pop("unit")
    result["postal_code"] = result.pop("postcode")

    return result

I am starting from an ibis table t backed by duckdb with the following columns ['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']

I add a new column containing the whole address:

t = t.mutate(
    full_address = _.address_lines + ", " + _.postal_code + ", " +  _.locality + ", " +  _.administrative_area + ", " +  _.region_code
)

and then run the following snippet for both versions of postal_parse_address():

parse_start = time()
t = t.mutate(
  libpostal_address=postal_parse_address(_.full_address)
)
print(t)
print(time() - parse_start)

Current implementation takes 4.291484832763672 seconds, the above one just 0.206512451171875 seconds.

My laptop specs:

OS: Ubuntu 21.04 x86_64 
Host: Latitude 5500 
Kernel: 5.11.0-49-generic 
CPU: Intel i7-8665U (8) @ 4.800GHz 
GPU: Intel WhiskeyLake-U GT2 [UHD Graphics 620] 
Memory: 8067MiB / 15813MiB 

@NickCrews @jstammers:

jstammers commented 2 months ago

Hi @lmores, thanks for looking into this - that looks pretty promising.

I'm not able to replicate this using a test dataset of my own with ~130k records. I've made a slight change to your evaluation code to call .cache() and force the function to be evaluated on the entire table

parse_start = time()
t = t.mutate(
  libpostal_address=postal_parse_address(_.full_address)
).cache()
print(t)
print(time() - parse_start)

When I do that I see a runtime of around 32s for the current version and 28s for your version.

I've also tried profiling the execution time of postal.parser.parse_address

from postal.parser import parse_address
addresses = t.full_address.execute()
parse_start = time()
s = [parse_address(p) for p in addresses.to_list()]
print(time() - parse_start)

which gives me a time of ~21s. Given that we're calling this as a non-vectorized python UDF that evaluates it row-by-row, I don't expect we would do any better than this without some parallelization

lmores commented 2 months ago

Thanks for noting that I should call .cache() to actually execute the query on all rows, I have never used ibis before (those times looked suspicious...).

I will experiment a bit more and let you know if I manage to achieve some more gains.

NickCrews commented 2 months ago

Ideally I would like the dataset to be ~1M records, since duckdb parallelizes in chunks of ~122k. If we use at least this much data then we will actually see a difference if we can get parralellizing to work. More than 1M records seems unneeded to check for parrallelization, and I bet 1M is still gonna be really slow. It might even be so slow its unusable, IDK if 130k takes 30 seconds...

Here are 1M addresses from the Alaska Public Offices Commission, how about we try making your script reproducible with this? These are public records from campaign contributions, political campaigns are required to disclose who they are getting their money from. This is actually a parquet file, but github won't let me upload that, so I had to lie and change the suffix to .zip. Just change it back to .parquet and it should work. apoc_addresses_1M.zip

Yes, you will need to call .cache() just before the benchmark to ensure all the data is loaded into the database, and then .cache() right after to test for it persisting. To ensure we don't get the overhead from copying the 1M source rows, we should only persist the output:

inp = inp.cache()
start = time.time()
res = func(inp.oneline)
persisted = res.as_table().cache()
end = time.time()

A good way to do this would be to try to assign blame to different parts of the UDF, by making several different versions that mock various parts out:

The deltas between these should give us a sense of how much each part is contributing to timing. Or more likely, it doesn't appear to make any sense, and then we figure out what to do :)

My laptop specs:

This is good we have different machines to compare between, I have an M1 macbook.

@jstammers if you run your benchmark on the above data, can you share your specs?

I don't expect we would do any better than this without some parallelization

IDK how duckdb/other backends handle parrallelization with UDFs. My impression is not well.

I think my first hunch would be to pursue caching. At the python level would be trivial using functools.cache(), or more complicated we could even do it at the duckdb level, and actually reduce the number of times we actually call the UDF.

jstammers commented 2 months ago

On my personal computer with the following specs

OS: Manjaro Linux x86_64 
Kernel: 6.9.5-1-MANJARO 
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz 
GPU: NVIDIA GeForce GTX 1080 Ti 
Memory: 10827MiB / 32026MiB 

I can parse the addresses for that dataset in 89s

On my work VM which I used for the previous test, with the following specs

OS: Ubuntu 22.04.4 LTS x86_64 
Host: Virtual Machine 7.0 
Kernel: 5.15.0-1061-azure 
CPU: Intel Xeon Platinum 8171M (8) @ 2.095GHz 
GPU: 00:08.0 Microsoft Corporation Hyper-V virtual VGA 
Memory: 11187MiB / 23012MiB 

I see a runtime of 153s

I like the idea of caching the address parsing, particularly as there's likely to be a lot of redundancy in the addresses, although that would probably require some further digging into the inner workings of postal or libpostal.

functools.cache() will certainly be simpler to implement, but I'm not sure there'll be significant improvements if the addresses are mostly distinct