Kanaries / pygwalker

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis
https://kanaries.net/pygwalker
Apache License 2.0
13.4k stars 700 forks source link

[BUG] Memory growth when using PyGWalker with Streamlit #618

Open ChrnyaevEK opened 2 months ago

ChrnyaevEK commented 2 months ago

Describe the bug I observe RAM growth when using PyGWalker with Streamlit framework. The RAM usage constantly grow on page reload (on every app run). When using Streamlit without PyGWalker, RAM usage remain constant (flat, does not grow). It seems like memory is never released, this was observed indirectly (we tracked growth locally, see reproduction below, but we also observe same issue in Azure web app and RAM usage never decline).

To Reproduce We tracked down the issue with isolated Streamlit app with PyGwalker and memory profile (run with python -m streamlit run app.py):

# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
from memory_profiler import profile
from pygwalker.api.streamlit import StreamlitRenderer

@profile
def app():
    # Create random dataframe
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
    render = StreamlitRenderer(df)
    render.explorer()
app()

Observed output for a few consequent reloads from browser (press R, rerun):

Line #    Mem usage    Increment  Occurrences   Line Contents
    13    302.6 MiB     23.3 MiB           1       render.explorer()
    13    315.4 MiB     23.3 MiB           1       render.explorer()
    13    325.8 MiB     23.3 MiB           1       render.explorer()

Expected behavior RAM usage to remain at constant level between app reruns.

Screenshots On screenshot you may observe a user activity peaks (cause CPU usage) and growing RAM usage (memory set). Metrics from Azure

On this screenshot a debug app memory profiling is displayed. Debug app memory profile

Versions streamlit 1.38.0 pygwalker 0.4.9.3 memory_profiler (latest) python 3.9.10 browser: chrome 128.0.6613.138 (Official Build) (64-bit) Tested locally on Windows 11

Thanks for support!

ChrnyaevEK commented 2 months ago

Update

It seems like I may have misinterpreted observations. I continued to track production app and did some more testing and results point away from PyGWalker as I originally thought (potentially to Azure web app or our production code other issues). I will do local tests with memory profiler to see how it behaves overtime to rule out this observation as well.

I'm sorry for disturbance, I will continue debug with new evidences.

Production app observations

Health endpoint has been added to our production version and now we observe strange memory behaviour even without opening PyGWalker explorer (PyGWalker was still imported as package). Health opens empty Streamlit page every 5 mins and over last 24h a RAM usage was gradually growing (on image you can observe used memory getting closer to 500Mb without spikes with constant increase rate related to health calls).

RAM usage in production RAM usage in production

Sample app deployment

I also tested sample app deployment on Azure to exclude Azure resource virtualization issues, but results did not confirm original hypothesis.

Without PyGWalker

image Sample app without PyGWalker on Azure

# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
import streamlit as st

def app():
    # Create random dataframe
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
    st.table(df)
app()

With PyGWalker

Sample app with PyGWalker was also deployed to Azure (it is running for few hours now). How ever it behaves as expected and release memory when objects are destroyed. Which makes me think, that the problem with our production version lays somewhere else.

image Sample app with PyGWalker on Azure

import numpy as np
np.random.seed(seed=1)
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    render = StreamlitRenderer(df)
    render.explorer()
app()
longxiaofei commented 2 months ago

Hi @ChrnyaevEK , Thanks for your feedback.

Using pygwalker latest version, and try to cache StreamlitRenderer, it may avoid memory growth.

from pygwalker.api.streamlit import StreamlitRenderer
import pandas as pd
import streamlit as st

@st.cache_resource
def get_pyg_renderer() -> "StreamlitRenderer":
    df = pd.read_csv("xxx")
    return StreamlitRenderer(df)

renderer = get_pyg_renderer()

renderer.explorer()

There are several reasons why pygwalker memory grows:

  1. StreamlitRenderer(df) will parse the dataframe and infer the data type.
  2. render.explorer() It will render the ui using html iframe(0.4.9.8 version has used the streamlit custom component to render pygwalker ui. The streamlit component has optimized this part of the memory overhead)
  3. For data calculation communication, the calculated data needs to complete http communication through the customized tornado endponit.(This will also be optimized in future versions)

In the next period of time, pygwalker will optimize the user experience of the streamlit component. Thank you again for your feedback.

ChrnyaevEK commented 2 months ago

Hi @longxiaofei ! Thanks for your attention.

Caching

I'm afraid that caching is not an option in this case, our data change with every request and thus cached function should look more like this:

@st.cache_resource
def get_pyg_renderer(key: str) -> "StreamlitRenderer":
    df = pd.read_csv(key)
    ...

which basically is equivalent for no cache at all. ttl and max_entries will not help either.

I did however test this approach and I'm still facing the same strange behavior.

import numpy as np
import pandas as pd

import streamlit as st
from pygwalker.api.streamlit import StreamlitRenderer

@st.cache_resource(max_entries=3, ttl=20)
def get_render(key: int):
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )

    return StreamlitRenderer(df)

def app():
    render = get_render(np.random.randint(1, 100))
    render.explorer()

app()

Running this app locally (windows, as described in first massage with pygwalker 0.4.9.3 as this is our production version) results in constantly growing memory (it seems to occasionally release insignificant amount of memory, but it does not return to initial values). image RAM used by python process with streamlit server with cached pygwalker render

Other local tests

I did also test few other code snippets locally to confirm that memory will eventually be released, but it seems like it's not.

Bare Streamlit

Code

import numpy as np
import pandas as pd
import streamlit as st

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    st.dataframe(df)
app()

Debug sequence

streamlit server start (python -m streamlit run ...) - 12:25 (memory increase due to initial object initialization)
restart (R) - 12:27 (memory increased)
restart (R) - 12:28 (memory increased)
restart (R) - 12:29 (memory increased)
restart (R) - 12:30 (memory increased)
restart (R) - 12:31 (memory did not react)
page close - 12:32 (memory decreased, but not to initial level)
stop - 12:58 (before stop a few slight memory decreases were observed without any external trigger)
Total test time: ~30min

Graph

See attached PDF debug.pdf

Streamlit with PyGWalker

Code

import numpy as np
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    render = StreamlitRenderer(df)
    render.explorer()
app()

Debug sequence

start - 13:09
restart - 13:11 (significant memory increase)
restart - 13:12 (memory increase)
restart - 13:13 (memory increase)
restart - 13:14 (memory increase)
restart - 13:15 (memory increase)
page close - 13:16 (memory decrease, not to initial values)
stop - 13:40 (no memory decrease observed)

Graph

See attached PDF debug.pdf, same as above

Conclusions up to the moment

Apps with and without PyGWalker both hold memory. PyGWalker allocate memory on every rerun, bare Streamlit seems to eventually saturate (may not allocate noticeable amount of memory).

There is no issue openning multiple Streamlit apps without PyGWalker, but as soon as PyGWalker is used we run out of memory (even with cache). This seems to be confirmed locally and on Azure.

I still suspect some issue with PyGWalker on Streamlit (may be PyGWalker just misuse Streamlit caching mechanisms), can you please check steady memory growth when running minimal PyGWalker app locally?

Thanks!