Closed F1nnM closed 3 years ago
Transferring the data like that, every datapoint containsing all column names seems a rather wasteful use of bandwidth.
Actually if the data is compressed (gzip or other) it should be very close to the 16Mb of your dataset once compressed - which is the case for hiplot server, but not for streamlit. There is an additional problem with streamlit - it will send the whole data back to the client at every refresh (aka there is no way to tell it that the data is the same, except for checking string equality of the JSON representation).
Nevertheless, the data format used by HiPlot to transmit data could be made way more efficient - I'm hacking something in https://github.com/facebookresearch/hiplot/pull/159 . Will tell you once it's ready for testing :)
You should be able to test now with the rc
version of the next release (pip install hiplot==0.1.23rc116
).
The feature is opt-in for now, but you can enable it with experiment._compress = True
.
(I might remove this property in the future and make everything compressed by default)
That was super quick.. Impressive! Definitely going to try that now!
Seems to be working great! 👍 Thank you!
There is an additional problem with Streamlit - it will send the whole data back to the client at every refresh
Thats just how Streamlit works, yeah. Can't do much about that, maybe one day.
However something that you might be able to add, I'm not sure if that works here, is Streamlit caching. You can't cache the transmission, but the processing of the data, before it's sent out.
I don't know if you've already looked into that, but every function annotated with @st.cache
will be cached. Meaning that, if the parameters and all the other variables used in that function don't change Streamlit will not execute that function again, but return the cached value for that function. From my experience the hashing of even big lists is surprisingly fast, so it might be worth it.
Just an idea though.
I use that to cache the creation of the experiment, so that's fine, but I think exp.display_st() does internally still quite some processing? Which I can't cache, because that method also is responsible for displaying the final result.
display_st
indeed does some stuff (mostly converting everything to JSON).
I would love to be able to send None
as the data if I could know that the dataset didn't change. If you generate your dataset using st.cache
, I guess I can just compare the objects directly without even hashing. But I would need to store the previous dataset id
somewhere - I'll check tomorrow, maybe streamlit's cache can be the solution to that :)
I've found a problem with the compress option. It seems to shuffle the order of my columns. No big problem in the parallel plot, as I can supply the order there. But I would also like to control the order of columns in the table.
Hey :)
I've got a solution using streamlit's caching:
https://facebookresearch.github.io/hiplot/tuto_streamlit.html#improving-performance-with-streamlit-caching-experimental
(you'll need the latest RC: pip install hiplot==0.1.23rc118
)
Caching a full hiplot.Experiment
takes too much time (I measured ~5s, I assume because streamlit wants to watch for mutations of the cached copy), however I got a solution working.
In my code I actually was caching an entire experiment, but I allowed output mutations with @st.cache(allow_output_mutation=True)
. This tells Streamlit to not try to hash the output of the function, hence it only has to hash the input parameters, which makes it a lot faster (I think at least).
I basically had the code:
@st.cache(allow_output_mutation=True, show_spinner=False)
def generate_hiplot_experiment(data):
exp = hip.Experiment.from_dataframe(data)
# couple of settings
return exp
data = ....
exp = generate_hiplot_experiment(data)
exp.display_st()
and this ran pretty fast.
generate_hiplot_experiment
took 7-10 seconds for a 80k rows (with caching or without, no measurable difference) and only 200ms on subsequent runs with caching
display_st
however always took 3-5 seconds to run. With that new option now it's cut down to 300ms in subsequent runs. That's a great improvement!
Thanks a lot!
Glad to hear :) Closing the issue then
Well, except for the issue that compression shuffles the order of columns in the table. Unless that would be a new issue. Or a feature request to be able to supply the order to the table.
Oh and I do need the return values of the brush extends. Is it impossible to combine this with the frozen copy?
Oh and I do need the return values of the brush extends. Is it impossible to combine this with the frozen copy?
Right. I'll try to get that next week - might change the API a bit.
Top, thank you!
I updated the API to allow to specify return values. I updated the docs, but in short you should now use to_streamlit
instead of frozen_copy
:
@st.cache
def get_experiment():
big_exp = # <insert code to create your experiment>
# EXPERIMENTAL: Reduces bandwidth at first load
big_exp._compress = True
# ... convert it to streamlit and cache that (`@st.cache` decorator)
return big_exp.to_streamlit(key="hiplot", ret="brush_extents")
xp = get_experiment() # This will be cached the second time
brush_extents = xp.display()
You can test it now with the RC version:
pip install hiplot==0.1.23rc121
Seems to be working fine! Thank you very much!
Streamlit currently has a hard limit of 50Mb for a single component. I have dataset that is only 16Mb of data as a .csv, but as HiPlot transfers that as JSON suddenly the data is over 200Mb. Transferring the data like that, every datapoint containsing all column names seems a rather wasteful use of bandwidth.
Not only does it affect the Streamlit component, but also considerably slows down loading large datasets in the standalone HiPlot-application (assuming the data there is transferred the same way).
My suggestion would be to transfer the column names as an array and the datapoints as arrays as well. On the client they then can be matched by position in the array.