Closed lmores closed 4 months ago
This looks very promising, thanks for starting this @lmores .
I wonder if we can avoid the problem you've highlighted by explicitly returning them in order at the end
{
"street1": result["street1"],
"street2": result["street2"],
...
}
although this feels unnecessary as I would expect the fields to be inferred correctly from the schema. This could be an upstream bug with ibis - here's what I see with a toy example where I reverse the order of the fields in the address
import ibis
from ibis.expr import datatypes as dt
ibis.options.interactive = True
import pandas as pd
ADDRESS_SCHEMA = dt.Struct(
{
"street1": "string",
"street2": "string",
"city": "string",
"state": "string",
"postal_code": "string",
"country": "string",
}
)
@ibis.udf.scalar.python
def reversed_address() -> ADDRESS_SCHEMA:
return {
"country": "country",
"postal_code": "postal_code",
"state": "state",
"city": "city",
"street2": "street2",
"street1": "street1",
}
@ibis.udf.scalar.python
def address() -> ADDRESS_SCHEMA:
return {
"street1": "street1",
"street2": "street2",
"city": "city",
"state": "state",
"postal_code": "postal_code",
"country": "country",
}
table = ibis.memtable(pd.DataFrame({"ix":[1,2,3,4]}))
table.mutate(address=address(), rev_address=reversed_address())
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ix ┃ address ┃ rev_address ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64 │ struct<street1: string, street2: string, city: string, state: string, postal_co… │ struct<street1: string, street2: string, city: string, state: string, postal_co… │
├───────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ {'street1': 'street1', 'street2': 'street2', ... +4} │ {'street1': 'country', 'street2': 'postal_code', ... +4} │
│ 2 │ {'street1': 'street1', 'street2': 'street2', ... +4} │ {'street1': 'country', 'street2': 'postal_code', ... +4} │
│ 3 │ {'street1': 'street1', 'street2': 'street2', ... +4} │ {'street1': 'country', 'street2': 'postal_code', ... +4} │
│ 4 │ {'street1': 'street1', 'street2': 'street2', ... +4} │ {'street1': 'country', 'street2': 'postal_code', ... +4} │
└───────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┘
Maybe for now, the solution is to explicitly enforce the order of the fields in the returned object.
Another edge-case to test would be one in which multiple values for the same field are returned. Here's an example for an address that looks correct, but because the city
and county
are swapped, the parsed result contains multiple city
components
parse_address("123 Elm Street, Los Angeles, Burbank, CA 91501")
[('123', 'house_number'),
('elm street', 'road'),
('los', 'city'),
('angeles', 'suburb'),
('burbank', 'city'),
('ca', 'state'),
('91501', 'postcode')]
I expect these are more likely to be invalid parses than correct ones, but worth noting.
Thanks for this @lmores , this looks awesome!
That is a bug with ibis, I filed an issue at https://github.com/ibis-project/ibis/issues/9613. If possible, can we keep this PR/discussion just related to perf/benchmarking as much as possible?
Here is my interpretation of the benchmarks:
so what this says to me is the time of this UDF can be accounted as the sum of
What do you think of this interpretation?
If this is correct, then trying to optimize the python is going to be fruitless, and we should just go with the most idiomatic version we can.
Even if we don't change our implementation, I still think this benchmarking harness code is useful and we should add it
@NickCrews: I agree with your interpretation of the results, but not on the conclusions XD; and here is why: all functions named _parse_fn__*
have been obtained by modifying my implementation of the UDF, not the current one! (my bad, should have mentioned it).
Timings must be read as follows:
$ python -m benchmarks.parse_address
_parse_fn__noop took 14.6711 seconds
_parse_fn__python_only took 15.4183 seconds # my implementation, not calling pypostal
_parse_fn__postal_only took 87.2642 seconds
_parse_fn__complete took 88.1358 seconds # my implementation
postal_parse_address took 107.9780 seconds # current implementation
Since current implementation (postal_parse_address
) takes 107
seconds and my implementation (_parse_fn__complete
) takes 88
seconds, we would gain ~20
seconds switching to the new one (and your analysis proves that we cannot improve anymore with plain python as _parse_fn__complete - _parse_fn__postal_only ~= .9s
).
Does it make sense?
Ahh, thank you, yes that makes sense. Let's add some tests, fixup your implementation, and get it merged! Exciting!
FYI I just pushed https://github.com/NickCrews/mismo/commit/ac10b6c60f0e8069f882f3cdfbe0d3f59023c784 that adjusts the current implementation a bit, but shouldn't affect runtime
I am looking back at my old notes from https://github.com/NickCrews/mismo/issues/19, and I think I want to take this opportunity to actually use pytest-benchmark for this. So I'm tweaking this PR right now locally to use this framework, so hold off on any edits you're making. Thanks!
lol, actually we already have pytest-benchark set up. grep for benchmark
and you can find examples. Still, I'm migrating this PR to use that.
The failing tests are unrelated, I will fix those up separately.
@lmores this is ready to merge whenever! You need to mark this PR as not WIP for me to be able to merge it.
nevermind I was able to remove the WIP.
If you want, I'd love to see you actually change the implementation to the faster version in a followup PR!
Not ready for merge, just opening to share the code with @NickCrews and @jstammers following discussion in #47 (is there a better way?).
To run the benchmarks just execute
python -m benchmarks.parse_address
. This will print the timings on the screen and store the DuckDB files insidebenchmarks/db/<TIMESTAMP>
.Here are my results:
My laptop specs:
Note:
_parse_fn__complete
has (at least) one bug. When theresult
dictionary is converted to an ibis/DuckDB data structure, keys and values are shuffled (maybe the code that makes the conversion relies on the dictionary entries to have the same order as the fields declared inADDRESS_SCHEMA
?). Here are the first rows of the table I get: