Closed DGolubets closed 1 month ago
This is a great addition!
Just one thing: do we really need to pass that boolean column to rust? E.g. say we have our features here like [f1, f2, ...]. Maybe we can do
[f1.filter(condition), ... , fn.filter(condition)]
I have not tested this. In theory, all the conditions will be run in parallel, and if used in lazy mode, the condition is only evaluated once. It's obviously a simpler approach but putting it in Rust isn't bad either.
I am planning to run this experiment in the next few days and compare results. If you can test it before I do, feel free to post the results!
Thanks a lot!
I misunderstood your situation.. Please disregard the test in my previous comment.
For your use case, the simplest "hack" is to try the following
size = 20
df = pl.DataFrame({
"id": range(size),
}).with_columns(
pds.random().alias("var1"),
pds.random().alias("var2"),
pds.random().alias("var3"),
pds.random_int(0, 2).alias("can_select"),
pl.col("id").cast(pl.UInt32)
)
df.select(
"id",
"can_select",
pl.when(
pl.col("can_select") == 1
).then(
pds.query_knn_ptwise(
pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in n-d space
index = "id", # pl.col("id"), str | pl.Expr
k = 3,
dist = "l2", # squared l2
parallel = True
)
).otherwise(
None
)
.alias("best friends")
)
Notice that some indices are best friends for other indices, despite the fact that they shouldn't be selected at the first place. (I believe this is the behavior you want?). The reason has to do with Polar's when then implementation and I do not want to dive too much into the details..
Can you give this a try and see if this reduces memory?
I used a random dataset M=300k, N=60k, with 10 features to decide distance, and set get k = 5k. The query finished in 5 min on my local machine with peak memory usage at ~9G. I would say this memory usage is very reasonable for this amount of data.
Here is my version info:
--------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-6.8.9-arch1-2-x86_64-with-glibc2.39 Python: 3.12.3 (main, Apr 23 2024, 09:16:07) [GCC 13.2.1 20240417]
Hi,
Neat trick with pl.when
I didn't think that would work.
The performance however is not great. This little test:
import polars as pl
import polars_ds as pds
from time import time
from polars.testing import assert_frame_equal
N = 100_000
K = 5000
print("Preparing test data..")
df = pl.DataFrame(
{
"id": [i for i in range(N)],
"can_select": [i % 6 == 0 for i in range(N)],
"lat": [50.1 + i % 10 / 100 for i in range(N)],
"lon": [0.4 + i % 10 / 100 for i in range(N)],
}
)
print("Running KNN with Rust filter..")
t0 = time()
knn_df1 = df.with_columns(
pds.query_knn_ptwise(
pl.col("lat"),
pl.col("lon"),
index="id",
skip_index=pl.col("can_select").not_(),
k=K,
dist="h",
parallel=True,
return_dist=True,
).alias("nearby")
)
t1 = time()
print(f"Finished in {t1-t0} s..")
print("Running KNN with when/otherwise..")
t0 = time()
knn_df2 = df.with_columns(
pl.when(pl.col("can_select"))
.then(
pds.query_knn_ptwise(
pl.col("lat"),
pl.col("lon"),
index="id",
k=K,
dist="h",
parallel=True,
return_dist=True,
)
)
.otherwise(None)
.alias("nearby")
)
t1 = time()
print(f"Finished in {t1-t0} s..")
print("Comparing..")
assert_frame_equal(knn_df1, knn_df2)
produces
Preparing test data..
Running KNN with Rust filter..
Finished in 7.47095513343811 s..
Running KNN with when/otherwise..
Finished in 44.36558175086975 s..
Comparing..
If I run them separately to check memory consumption (with /usr/bin/time --verbose
), I get:
Maximum resident set size (kbytes): 2276540
and
Maximum resident set size (kbytes): 11973668
respectively.
With M=300000 the pl.when
approach fails to complete on my machine (I have 32BG RAM).
It appears to me, that when/otherwise only makes post-filtering in this case. The plugin is called with the same inputs and has to do full work.
That's right. My bad should have realized that earlier.
I will approve for now, and I will likely come back to this at a later time to make this generally available to all similar knn queries.
Thanks
I have spent some time rethinking about KNN queries and wrote my own kd-tree implementation. You can see my implementation here: https://github.com/abstractqqq/arkadia
I am trying to integrate it into polars_ds. This might take weeks if not months. But I think it is worth it. I am expecting at least a 10% speed boost
I have spent some time rethinking about KNN queries and wrote my own kd-tree implementation. You can see my implementation here: https://github.com/abstractqqq/arkadia
I am trying to integrate it into polars_ds. This might take weeks if not months. But I think it is worth it. I am expecting at least a 10% speed boost
Nice!
I have a use case where I need to find K nearest neighbors for a subset of N rows among M total rows. My K and M are quite large: K = 5000, M = 300000, N = 50000 (just 1/6 of M). Running KNN on all M rows uses too much memory.
I suggest to add an optional boolean column to indicate if rows should be excluded from calculation. When the column is not there, there is just an additional Option check for every row. When it's there, I allocate builders with 0 capacity, to conserve memory and push nulls for skipped rows.
A side change: the value capacity for
pl_knn_ptwise_w_dist
parallel was8
which I changed to be in line with the other code branches.