abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
261 stars 17 forks source link

Skip some rows in KNN #157

Closed DGolubets closed 1 month ago

DGolubets commented 1 month ago

I have a use case where I need to find K nearest neighbors for a subset of N rows among M total rows. My K and M are quite large: K = 5000, M = 300000, N = 50000 (just 1/6 of M). Running KNN on all M rows uses too much memory.

I suggest to add an optional boolean column to indicate if rows should be excluded from calculation. When the column is not there, there is just an additional Option check for every row. When it's there, I allocate builders with 0 capacity, to conserve memory and push nulls for skipped rows.

A side change: the value capacity for pl_knn_ptwise_w_dist parallel was 8 which I changed to be in line with the other code branches.

abstractqqq commented 1 month ago

This is a great addition!

Just one thing: do we really need to pass that boolean column to rust? E.g. say we have our features here like [f1, f2, ...]. Maybe we can do

[f1.filter(condition), ... , fn.filter(condition)] 

I have not tested this. In theory, all the conditions will be run in parallel, and if used in lazy mode, the condition is only evaluated once. It's obviously a simpler approach but putting it in Rust isn't bad either.

I am planning to run this experiment in the next few days and compare results. If you can test it before I do, feel free to post the results!

Thanks a lot!

abstractqqq commented 1 month ago

I misunderstood your situation.. Please disregard the test in my previous comment.

For your use case, the simplest "hack" is to try the following

size = 20
df = pl.DataFrame({
    "id": range(size), 
}).with_columns(
    pds.random().alias("var1"),
    pds.random().alias("var2"),
    pds.random().alias("var3"),
    pds.random_int(0, 2).alias("can_select"),
    pl.col("id").cast(pl.UInt32)
)

df.select(
    "id",
    "can_select",
    pl.when(
        pl.col("can_select") == 1
    ).then(
        pds.query_knn_ptwise(
            pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in n-d space
            index = "id",  # pl.col("id"), str | pl.Expr
            k = 3, 
            dist = "l2", # squared l2
            parallel = True
        )
    ).otherwise(
        None
    )    
    .alias("best friends")
)

Notice that some indices are best friends for other indices, despite the fact that they shouldn't be selected at the first place. (I believe this is the behavior you want?). The reason has to do with Polar's when then implementation and I do not want to dive too much into the details..

Can you give this a try and see if this reduces memory?

I used a random dataset M=300k, N=60k, with 10 features to decide distance, and set get k = 5k. The query finished in 5 min on my local machine with peak memory usage at ~9G. I would say this memory usage is very reasonable for this amount of data.

Here is my version info:

--------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-6.8.9-arch1-2-x86_64-with-glibc2.39 Python: 3.12.3 (main, Apr 23 2024, 09:16:07) [GCC 13.2.1 20240417]

DGolubets commented 1 month ago

Hi, Neat trick with pl.when I didn't think that would work.

The performance however is not great. This little test:

import polars as pl
import polars_ds as pds
from time import time
from polars.testing import assert_frame_equal

N = 100_000
K = 5000

print("Preparing test data..")
df = pl.DataFrame(
    {
        "id": [i for i in range(N)],
        "can_select": [i % 6 == 0 for i in range(N)],
        "lat": [50.1 + i % 10 / 100 for i in range(N)],
        "lon": [0.4 + i % 10 / 100 for i in range(N)],
    }
)

print("Running KNN with Rust filter..")
t0 = time()
knn_df1 = df.with_columns(
    pds.query_knn_ptwise(
        pl.col("lat"),
        pl.col("lon"),
        index="id",
        skip_index=pl.col("can_select").not_(),
        k=K,
        dist="h",
        parallel=True,
        return_dist=True,
    ).alias("nearby")
)
t1 = time()
print(f"Finished in {t1-t0} s..")

print("Running KNN with when/otherwise..")
t0 = time()
knn_df2 = df.with_columns(
    pl.when(pl.col("can_select"))
    .then(
        pds.query_knn_ptwise(
            pl.col("lat"),
            pl.col("lon"),
            index="id",
            k=K,
            dist="h",
            parallel=True,
            return_dist=True,
        )
    )
    .otherwise(None)
    .alias("nearby")
)
t1 = time()
print(f"Finished in {t1-t0} s..")

print("Comparing..")
assert_frame_equal(knn_df1, knn_df2)

produces

Preparing test data..
Running KNN with Rust filter..
Finished in 7.47095513343811 s..
Running KNN with when/otherwise..
Finished in 44.36558175086975 s..
Comparing..

If I run them separately to check memory consumption (with /usr/bin/time --verbose), I get:

Maximum resident set size (kbytes): 2276540

and

Maximum resident set size (kbytes): 11973668

respectively.

With M=300000 the pl.when approach fails to complete on my machine (I have 32BG RAM).

DGolubets commented 1 month ago

It appears to me, that when/otherwise only makes post-filtering in this case. The plugin is called with the same inputs and has to do full work.

abstractqqq commented 1 month ago

That's right. My bad should have realized that earlier.

I will approve for now, and I will likely come back to this at a later time to make this generally available to all similar knn queries.

Thanks

abstractqqq commented 2 weeks ago

I have spent some time rethinking about KNN queries and wrote my own kd-tree implementation. You can see my implementation here: https://github.com/abstractqqq/arkadia

I am trying to integrate it into polars_ds. This might take weeks if not months. But I think it is worth it. I am expecting at least a 10% speed boost

DGolubets commented 2 weeks ago

I have spent some time rethinking about KNN queries and wrote my own kd-tree implementation. You can see my implementation here: https://github.com/abstractqqq/arkadia

I am trying to integrate it into polars_ds. This might take weeks if not months. But I think it is worth it. I am expecting at least a 10% speed boost

Nice!