Add the ability to visually filter points and update the data

flekschas commented 1 year ago

As pointed out by @manzt, it'd be nice to allow filtering out points. Currently, this requires re-initializing the scatter plot instance. However, re-initialization will reset the current camera position etc. which is annoying.

I am proposing a new method for dynamic filtering:

scatter = Scatter(data=df, x='x', y='y')
scatter.filter([1,2,3]) # only show points 1, 2, and 3

@manzt What do yo think of this?

manzt commented 1 year ago

I like the idea. However, this still assumes that we want only subsets of the data. What if, for example, I was generating batches of points and wanted to add them to an existing plot as the computation runs. In addition, what are selection semantics when we have filtered the data? Does selection=[0] correspond to the original data or the new filtered subset? We then need to keep track of indices. What if you filter again? is that a filter on the existing filter or on the original data?

I think a more general API would be to allow the dataframe to be "reative" like encodings, keeping all the other state and updating the plot on reassignment:

scatter = Scatter(data=df, x='x', y='y')

display(scatter.show())

for new_data in random_data_generator():
    time.sleep(1)
    scatter.data(new_data) # clears selection (if there was one), keeps all existing encodings

This way you could just filter your dataframe by whatever means make sense. For example, two scatter plots where the second just displays the selection of the first:

s1 = Scatter(data=df, x='x', y='y')
s2 = Scatter(data=df, x='x', y='y')

def on_selection_change(change):
    subset = df.iloc[change.new]
    s2.data(subset)

s1.widget.observe(on_selection_change, names='selection')

ipywidgets.HBox([s1.show(), s2.show()])

flekschas commented 1 year ago

I think we're talking about two different ideas: filtering and updating the data. The difference is that filtering is extremely cheap because it only ever results in rendering a subset of the data that was already uploaded to the GPU. Changing the data is more expensive. Since there is no guarantee if any of the existing data can be reused, one has to flush the existing data and upload the new data to the GPU on every data change.

I think there are valid use cases for both. However, in your example, changing the data is unnecessary expensive as you could simply just render out a subset.

Implementing scatter.data() should be fairly simple. What's more tricky is scatter.filter() as it'll require changes to regl-scatterplot.

what are selection semantics when we have filtered the data? Does selection=[0] correspond to the original data or the new filtered subset? We then need to keep track of indices. What if you filter again? is that a filter on the existing filter or on the original data?

The semantics are fairly simple: your filter always operates on the bound data and is only every affecting the rendering (not the underlying data itself). E.g., scatter.filter([0, 1, 2]) will cause the plot to only render out the first three points. If you then call scatter.filter([3, 4, 5]), the plot would render the forth to sixth point (of your bound dataframe). The selection semantics remain the same. The point with index 10 is always going to reference the same point because the data itself isn't filtered.

manzt commented 1 year ago

Ok, that makes a lot of sense to me (and I see the motivation for filter!) I think both are valuable and handle two separate ideas, as you noted. There are certainly performance benefits to filtering.

flekschas commented 1 year ago

Here's a demo of visually filtering out points. It's super snappy as the only thing that changes is the buffer that indexes into the texture object :)

https://user-images.githubusercontent.com/932103/222612287-1dbce5e1-d0a2-4cb3-8076-a2d7effa0bb5.mp4

I imagine the data() function be fairly simple. I think all we need is a single argument:

scatter.data(df)

flekschas commented 1 year ago

Closing this as the two function have been added with #63 and released in v0.11.0.

flekschas / jupyter-scatter

Add the ability to visually filter points and update the data #61