Open lmores opened 4 months ago
Hi @lmores, thanks for looking into this - that looks pretty promising.
I'm not able to replicate this using a test dataset of my own with ~130k records.
I've made a slight change to your evaluation code to call .cache()
and force the function to be evaluated on the entire table
parse_start = time()
t = t.mutate(
libpostal_address=postal_parse_address(_.full_address)
).cache()
print(t)
print(time() - parse_start)
When I do that I see a runtime of around 32s for the current version and 28s for your version.
I've also tried profiling the execution time of postal.parser.parse_address
from postal.parser import parse_address
addresses = t.full_address.execute()
parse_start = time()
s = [parse_address(p) for p in addresses.to_list()]
print(time() - parse_start)
which gives me a time of ~21s. Given that we're calling this as a non-vectorized python UDF that evaluates it row-by-row, I don't expect we would do any better than this without some parallelization
Thanks for noting that I should call .cache()
to actually execute the query on all rows, I have never used ibis before (those times looked suspicious...).
I will experiment a bit more and let you know if I manage to achieve some more gains.
Ideally I would like the dataset to be ~1M records, since duckdb parallelizes in chunks of ~122k. If we use at least this much data then we will actually see a difference if we can get parralellizing to work. More than 1M records seems unneeded to check for parrallelization, and I bet 1M is still gonna be really slow. It might even be so slow its unusable, IDK if 130k takes 30 seconds...
Here are 1M addresses from the Alaska Public Offices Commission, how about we try making your script reproducible with this? These are public records from campaign contributions, political campaigns are required to disclose who they are getting their money from. This is actually a parquet file, but github won't let me upload that, so I had to lie and change the suffix to .zip. Just change it back to .parquet and it should work. apoc_addresses_1M.zip
Yes, you will need to call .cache() just before the benchmark to ensure all the data is loaded into the database, and then .cache() right after to test for it persisting. To ensure we don't get the overhead from copying the 1M source rows, we should only persist the output:
inp = inp.cache()
start = time.time()
res = func(inp.oneline)
persisted = res.as_table().cache()
end = time.time()
A good way to do this would be to try to assign blame to different parts of the UDF, by making several different versions that mock various parts out:
The deltas between these should give us a sense of how much each part is contributing to timing. Or more likely, it doesn't appear to make any sense, and then we figure out what to do :)
My laptop specs:
This is good we have different machines to compare between, I have an M1 macbook.
@jstammers if you run your benchmark on the above data, can you share your specs?
I don't expect we would do any better than this without some parallelization
IDK how duckdb/other backends handle parrallelization with UDFs. My impression is not well.
I think my first hunch would be to pursue caching. At the python level would be trivial using functools.cache(), or more complicated we could even do it at the duckdb level, and actually reduce the number of times we actually call the UDF.
On my personal computer with the following specs
OS: Manjaro Linux x86_64
Kernel: 6.9.5-1-MANJARO
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz
GPU: NVIDIA GeForce GTX 1080 Ti
Memory: 10827MiB / 32026MiB
I can parse the addresses for that dataset in 89s
On my work VM which I used for the previous test, with the following specs
OS: Ubuntu 22.04.4 LTS x86_64
Host: Virtual Machine 7.0
Kernel: 5.15.0-1061-azure
CPU: Intel Xeon Platinum 8171M (8) @ 2.095GHz
GPU: 00:08.0 Microsoft Corporation Hyper-V virtual VGA
Memory: 11187MiB / 23012MiB
I see a runtime of 153s
I like the idea of caching the address parsing, particularly as there's likely to be a lot of redundancy in the addresses, although that would probably require some further digging into the inner workings of postal
or libpostal
.
functools.cache() will certainly be simpler to implement, but I'm not sure there'll be significant improvements if the addresses are mostly distinct
Made a very rough benchmark on a dataset with
2_192_071
Italian street addresses (that unfortunately I am not allowed to share).I replaced the current implementation of
postal_parse_address()
withI am starting from an ibis table
t
backed by duckdb with the following columns['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']
I add a new column containing the whole address:
and then run the following snippet for both versions of
postal_parse_address()
:Current implementation takes
4.291484832763672
seconds, the above one just0.206512451171875
seconds.My laptop specs:
@NickCrews @jstammers: