abcsys / libem

Compound AI toolchain for fast and accurate entity matching, powered by LLMs.
https://libem.org
Apache License 2.0
17 stars 3 forks source link

feat: prompt-level batching for match #34

Closed daiwaid closed 2 months ago

daiwaid commented 2 months ago

The normal (single pair) matching code in libem/match.py and benchmark/util.py are left as is in anticipation of rework in PR #32. The TODOs in benchmark/util.py require changes from PR #33.

zenodflow commented 2 months ago

Nice! This is a key feature. @daiwaid Could you add an example in examples/optimize/batch.py? I will also use it to run some testing on your branch to help the review. Thanks!

In the example, perhaps do just 10 pairs to avoid burning tokens. Also, print out the stats so that we now about the latency etc.

daiwaid commented 2 months ago

Added example, and dealt with case when model will only output a single answer for an entire batch.

zenodflow commented 2 months ago

Please check #35 (note that #33 was merged earlier).

daiwaid commented 2 months ago

Merged new changes from #33, #35 and cleaned up commit history to align with main branch.

zenodflow commented 2 months ago

This should be now good to go.

Keeping a note on the performance improvements:

Benchmark: Matching done in 107.63s.
Benchmark: Precision     66.57
Benchmark: Recall    93.59
Benchmark: F1 score  77.8
Benchmark: Cost      $0.53134

This is on the amazon-google dataset; we see roughly 7x faster completion, 3% higher F1, with half of the cost.