facebookresearch / hiplot

HiPlot makes understanding high dimensional data easy
https://facebookresearch.github.io/hiplot/
MIT License
2.75k stars 143 forks source link

Displaying a lot of rows in Streamlit / Wasteful use of bandwidth #158

Closed F1nnM closed 3 years ago

F1nnM commented 3 years ago

Streamlit currently has a hard limit of 50Mb for a single component. I have dataset that is only 16Mb of data as a .csv, but as HiPlot transfers that as JSON suddenly the data is over 200Mb. Transferring the data like that, every datapoint containsing all column names seems a rather wasteful use of bandwidth.

Not only does it affect the Streamlit component, but also considerably slows down loading large datasets in the standalone HiPlot-application (assuming the data there is transferred the same way).

My suggestion would be to transfer the column names as an array and the datapoints as arrays as well. On the client they then can be matched by position in the array.

danthe3rd commented 3 years ago

Transferring the data like that, every datapoint containsing all column names seems a rather wasteful use of bandwidth.

Actually if the data is compressed (gzip or other) it should be very close to the 16Mb of your dataset once compressed - which is the case for hiplot server, but not for streamlit. There is an additional problem with streamlit - it will send the whole data back to the client at every refresh (aka there is no way to tell it that the data is the same, except for checking string equality of the JSON representation).

Nevertheless, the data format used by HiPlot to transmit data could be made way more efficient - I'm hacking something in https://github.com/facebookresearch/hiplot/pull/159 . Will tell you once it's ready for testing :)

danthe3rd commented 3 years ago

You should be able to test now with the rc version of the next release (pip install hiplot==0.1.23rc116). The feature is opt-in for now, but you can enable it with experiment._compress = True. (I might remove this property in the future and make everything compressed by default)

F1nnM commented 3 years ago

That was super quick.. Impressive! Definitely going to try that now!

F1nnM commented 3 years ago

Seems to be working great! 👍 Thank you!

There is an additional problem with Streamlit - it will send the whole data back to the client at every refresh

Thats just how Streamlit works, yeah. Can't do much about that, maybe one day.

However something that you might be able to add, I'm not sure if that works here, is Streamlit caching. You can't cache the transmission, but the processing of the data, before it's sent out. I don't know if you've already looked into that, but every function annotated with @st.cache will be cached. Meaning that, if the parameters and all the other variables used in that function don't change Streamlit will not execute that function again, but return the cached value for that function. From my experience the hashing of even big lists is surprisingly fast, so it might be worth it. Just an idea though.

I use that to cache the creation of the experiment, so that's fine, but I think exp.display_st() does internally still quite some processing? Which I can't cache, because that method also is responsible for displaying the final result.

danthe3rd commented 3 years ago

display_st indeed does some stuff (mostly converting everything to JSON). I would love to be able to send None as the data if I could know that the dataset didn't change. If you generate your dataset using st.cache, I guess I can just compare the objects directly without even hashing. But I would need to store the previous dataset id somewhere - I'll check tomorrow, maybe streamlit's cache can be the solution to that :)

F1nnM commented 3 years ago

I've found a problem with the compress option. It seems to shuffle the order of my columns. No big problem in the parallel plot, as I can supply the order there. But I would also like to control the order of columns in the table.

danthe3rd commented 3 years ago

Hey :)

I've got a solution using streamlit's caching: https://facebookresearch.github.io/hiplot/tuto_streamlit.html#improving-performance-with-streamlit-caching-experimental (you'll need the latest RC: pip install hiplot==0.1.23rc118)

Caching a full hiplot.Experiment takes too much time (I measured ~5s, I assume because streamlit wants to watch for mutations of the cached copy), however I got a solution working.

F1nnM commented 3 years ago

In my code I actually was caching an entire experiment, but I allowed output mutations with @st.cache(allow_output_mutation=True). This tells Streamlit to not try to hash the output of the function, hence it only has to hash the input parameters, which makes it a lot faster (I think at least).

I basically had the code:

@st.cache(allow_output_mutation=True, show_spinner=False)
def generate_hiplot_experiment(data):
    exp = hip.Experiment.from_dataframe(data)
    # couple of settings
    return exp

data = ....
exp = generate_hiplot_experiment(data)
exp.display_st()

and this ran pretty fast.

generate_hiplot_experiment took 7-10 seconds for a 80k rows (with caching or without, no measurable difference) and only 200ms on subsequent runs with caching display_st however always took 3-5 seconds to run. With that new option now it's cut down to 300ms in subsequent runs. That's a great improvement!

Thanks a lot!

danthe3rd commented 3 years ago

Glad to hear :) Closing the issue then

F1nnM commented 3 years ago

Well, except for the issue that compression shuffles the order of columns in the table. Unless that would be a new issue. Or a feature request to be able to supply the order to the table.

F1nnM commented 3 years ago

Oh and I do need the return values of the brush extends. Is it impossible to combine this with the frozen copy?

danthe3rd commented 3 years ago

Oh and I do need the return values of the brush extends. Is it impossible to combine this with the frozen copy?

Right. I'll try to get that next week - might change the API a bit.

F1nnM commented 3 years ago

Top, thank you!

danthe3rd commented 3 years ago

I updated the API to allow to specify return values. I updated the docs, but in short you should now use to_streamlit instead of frozen_copy:

@st.cache
def get_experiment():
    big_exp = # <insert code to create your experiment>
    # EXPERIMENTAL: Reduces bandwidth at first load
    big_exp._compress = True
    # ... convert it to streamlit and cache that (`@st.cache` decorator)
    return big_exp.to_streamlit(key="hiplot", ret="brush_extents")

xp = get_experiment()  # This will be cached the second time
brush_extents  = xp.display()

You can test it now with the RC version:

pip install hiplot==0.1.23rc121
F1nnM commented 3 years ago

Seems to be working fine! Thank you very much!