Flask monitoring dashboard not monitoring when the flask application is running through supervisor

monumahali commented 4 years ago

I am using the Flask monitoring dashboard for monitoring purposes. The issue comes when I run the flask app through Gunicorn, Flask monitoring dashboard works fine and normal but when I run it with the supervisor Flask monitoring dashboard is not updating i.e it's not monitoring anymore.

mircealungu commented 4 years ago

Hi @monumahali !

Can you provide more information about your setup? Is your code open source? Do you see anything in the logs? Are you sure that the FMD can find the configuration file at startup?

Cheers, M.

P.S. Did you see the survey? We'd be thankful if you filled it in :) https://docs.google.com/forms/d/10sOStXH13FoH45_xZaEwHEFWzRCDsr5gTgPWCdsdxp0/edit

monumahali commented 4 years ago

Hi @mircealungu

No, it's not open source. But think of it as a simple Flask application. The only problem is that it doesn't update it when the Flask app is started using the supervisor but works fine if the app starts with gunicorn. I have not checked the supervisor logs but will check and revert. No, I am not sure if FMD finds the configuration file, also I don't understand how and where to check this but you guys can definitely help me on this.

Thanks, Monu

mircealungu commented 4 years ago

Dear @monumahali ,

One thing that might be relevant is that the FMD tries to load it's configuration from the file that's specified in the environment variable FLASK_MONITORING_DASHBOARD_CONFIG.

However, unless you provide steps to help replicate your problem on my machine I can't help (I don't know what server you;re using instead of gunicorn, nor how is that configured).

Maybe you can create a toy GitHub project that illustrates the problem? Then I'll be happy to help!

Cheers, And good luck, M.

monumahali commented 4 years ago

Hi @mircealungu,

You can just create a simple flask app with FMD and try to run it with supervisor on top of gunicorn. supervisor is essentially used to start the server whenever it is down automatically.

If your flask app runs on supervisor and FMD is also updating and working fine then I can understand that supervisor is not the culprit.

mircealungu commented 4 years ago

I've never used Gunicorn, nor Supervisor. So unless you can share step by step what should I do to replicate the problem (e.g. what config should I use with the two of them, etc.), I don't think I can replicate your error .

RohitShende commented 4 years ago

Same issue is being faced by me. I am also running flask app using supervisor and gunicorn.

RohitShende commented 4 years ago

Steps to replicate the issue with gunicorn and supervisor:

Create a file wsgi.py

from app import create_app

application = create_app('production')

if __name__ == "__main__":
    application.run(debug=True)

Then install gunicorn using - pip install gunicorn

and try to run the app with gunicorn using command -

gunicorn -b 0.0.0.0:5000 wsgi:application

After that you use some of the endpoints and then go to http://localhost:5000/dashboard/login

Ater you login to the FMD you click on the "Median request duration (ms)"

And it shows 0 for all.

Attaching screenshot bug in fmd

RohitShende commented 4 years ago

it is not capturing the duration of the request. Though it captures the number of requests

RohitShende commented 4 years ago

@FlyingBird95 - Congratulations on the new release of 3.1.0 Please add support for gunicorn too. We are not able to monitor the apis since the production code uses gunicorn. And this is the case with most of the flask production apps

RohitShende commented 4 years ago

@mircealungu Please check my comments.

mircealungu commented 4 years ago

@RohitShende -- Thanks for the headsd up Rohit. We're aware of the gunicorn issues, but momentarily we don't have the engineering power to move quickly in that direction. I'll ping you when it's solved. If there was a paid version of FMD maybe we could afford to move faster. (BTW, would {you | your company} {be interested | afford} paying a small {monthly | yearly} fee for a service like FMD?

alexander34ro commented 4 years ago

I created a basic Flask + gunicorn + Flask Monitoring Dashboard application: https://github.com/alexander34ro/flask-monitoring-dashboard-gunicorn

I deployed it on Heroku: https://flask-monitoring-dash-gunicorn.herokuapp.com/

Checking the dashboard, the Median Request Duration is definitely not updating, even though the Number of Hits tab functions correctly.

Median Request Duration

Number of Hits

bogdanp05 commented 4 years ago

Hi everyone! I finally got the chance to look into the bug. The duration is recorded in the database, but the frontend doesn't display it when it's a too small of a number. Quick ways to verify this:

Create a new endpoint with a time.sleep(1) in it. The median duration will be displayed correctly.
Run the following query against the FMD database: select e.name, r.time_requested, r.duration from Request as r, Endpoint as e where r.endpoint_id=e.id and e.name="your_endpoint_name" This should show some very small non-zero values for the duration.

So the bug is real, but it seems to be a frontend issue and not a gunicorn/supervisor/data collection issue.

Disclaimer: I didn't use supervisor in my setup, but I'd be very surprised if that causes any issues, since supervisor is only a Unix tool that makes sure the process is restarted if it crashes.

mircealungu commented 4 years ago

@bogdanp05

We looked yesterday a bit with @alexander34ro at this problem. One observation is that, if you enable Monitoring level 3 for the hello-world endpoint this is what you get:

It seems to me that gunicorn does some program transformation magic and moves the entire body of the method in a separate thread. And since FMD doesn't handle parallel threads, this is why the times are always 0 or close to zero?

bogdanp05 commented 4 years ago

@mircealungu I see. My previous comment was strictly about the overview page.

For the profiler page, are we sure gunicorn is the cause? Does running the app via the Flask development server produce a different output? Looking at the screenshots of the hellow_world endpoints, it looks like the total duration is close to 0 ms, which is less than the profiler's sampling time, so nothing would be displayed regardless of the web server used.

RohitShende commented 4 years ago

I do not feel that the low duration is the cause. In my case it is taking around 6 sec i.e. 6000 ms for the endpoints even then it is showing 0 when using app with gunicornand if we use python run.py then it shows the durations. Check the images attached in my previous comments.

RohitShende commented 4 years ago

Attaching images of production vs local app

Local

python run.py

local

Production

gunicorn -k gevent -w 10 --worker-connections 1000 --certfile=/etc/ssl/certs/xyz.cer --keyfile=/etc/ssl/certs/xyz.key -b localhost:8080 --log-file gunicorn.log --log-level debug wsgi directory=/data/flask-app

production

bogdanp05 commented 4 years ago

Thanks for the extra screenshots and especially for the gunicorn command @RohitShende! I was running gunicorn -w 4 localhost:5000 main:app and I just couldn't replicate it. I'll try again later using your command.

alexander34ro commented 4 years ago

I updated the toy app with a 5 seconds delay and I tried to use a command as similar as possible to what @RohitShende is using.

web: gunicorn -k gevent -w 3 --worker-connections 1000 --log-file gunicorn.log --log-level debug wsgi:app

I redeployed the app and the median time is correct.

Here you can find the repo for the toy app: https://github.com/alexander34ro/flask-monitoring-dashboard-gunicorn And here is the dashboard running on Heroku: https://flask-monitoring-dash-gunicorn.herokuapp.com/dashboard

RohitShende commented 4 years ago

Hi,

After the above comment. I too tried to replicate the issue using this app https://github.com/alexander34ro/flask-monitoring-dashboard-gunicorn

But I failed to replicate the issue

Then I made this app same as my app using the same config file

The config file

flask-monitoring-config.cfg

[dashboard]
APP_VERSION=1.0
CUSTOM_LINK=dashboard
MONITOR_LEVEL=3
OUTLIER_DETECTION_CONSTANT=2.0
ENABLE_LOGGING=True

[authentication]
USERNAME=admin
PASSWORD=admin
GUEST_USERNAME=guest
GUEST_PASSWORD=['dashboardguest!', 'second_pw!']
SECURITY_TOKEN=cc83733cb0af8b884ff6577086b87909

[database]
DATABASE=sqlite:///flask_monitoringdashboard.db

[visualization]
COLORS={'main':'[0,97,255]', 'static':'[255,153,0]'}

I modified the code in app.py like below

app = Flask(__name__)
dashboard.config.init_from(file='flask-monitoring-config.cfg')
dashboard.bind(app)

And I was able to replicate the issue in hello_world app

With logging enabled I see these logs as well

Can't get the stacktrace of the main thread. Stopping StacktraceProfiler
Thread to monitor: 139829990007368
Running threads: dict_keys([139830678750976])
Can't get the stacktrace of the main thread.

The command I used to run the app using gunicorn was : gunicorn -k gevent -w 2 --worker-connections 10 -b 0.0.0.0:4000 --log-file gunicorn.log --log-level debug wsgi:app

Surprisingly when I use the below flask command to run the same hello app with same config file, it works fine and captures the result. It means there is some issue with Gunicorn + Some config that I am passing in the above config

export FLASK_APP=wsgi.py
flask run -h 0.0.0.0 -p 4444

RohitShende commented 4 years ago

On debugging more I found out the combination is

MONITOR_LEVEL=3 + Gunicorn

I changed it to MONITOR_LEVEL=2 and I was able to see the median time getting captured correctly.

I still saw the error log Can't get the stacktrace of the main thread.

Also observed that MONITOR_LEVEL=1 + Gunicorn works nicely and there is no error in the logs.

This confirms why it was working for @alexander34ro since the default MONITOR_LEVEL is 1 and it is also seen in the screenshot that he attached

FlyingBird95 commented 4 years ago

Hey all,

First of all, thank RohitStende for your analysis. This has helped me to better identify the exact problem. Let me first expain what we do in the FMD. We start by wrapping the endpoint based on monitoring-level specific functions. We use the following classes for this:

The first two profilers, are not actually profilers, because they only run to update the data in the database. The last two (level 2 and 3) are started before the request is handled, and are monitoring the stracktrace of the endpoint. Thus, when processing a request (that is profiled by either level 2 or 3), we're always dealing with two threads. And this is exactly what is going wrong using gunicorn/supervisor.

I've taken a look at the architecture of Gunicoorn. They mention a "pre-fork worker model". Since I didn't exactly know what this is, I looked at stack overflow. One of the answers explains that A fork is a completely separate *nix process.

Thus, the exact problem is that the our profilers cannot get the stacktrace of the main thread. They do this, using the following snippet:

try:
    frame = sys._current_frames()[self._thread_to_monitor]
except KeyError:
    log('Can\'t get the stacktrace of the main thread. Stopping StacktraceProfiler')
    log('Thread to monitor: %s' % self._thread_to_monitor)
    log('Running threads: %s' % sys._current_frames().keys())
traceback.extract_stack(frame)

My guess is that the sys._current_frames() can only get the frames (i.e. threads) in the same *nix process. Operating Systems do this for security reasons, otherwise you a process can monitoring any process on the same computer. N.B. I'm not 100% about the claims in this paragraph, but is seems reasonable.

My second claim would be that when a thread is spanned in gunicorn, it is handled in a different *nix process. I'm actually not entirely sure about this claim, but it would connect all the info mentioned above.

I would like to ask someone to verify my two claims that I've made above. When this is indeed the case, we can think about how to overcome this issue (maybe by using a different profiler, that doesn't run into a different thread).

Kind regards,

Patrick

dMAC95 commented 1 year ago

I'm trying to implement a patch, here's what chatGPT said about implementation

"When Gunicorn starts, it initializes the main process and then forks worker processes. Because of the nature of process forking, the worker processes don't share memory with the parent process or with each other, so anything initialized in the main process isn't available in the worker processes.

Gunicorn provides hooks that you can use to run code during certain parts of the server's lifecycle, and these hooks can be useful for solving problems related to worker initialization. The post_fork hook in particular is run after a worker process has been forked, and you can use it to reinitialize your module in each worker process.

Here's an example of how to do this. Suppose the module you want to initialize is named mymodule and has an initialize() function:

def post_fork(server, worker):
    import mymodule
    mymodule.initialize()

You would add this function to your Gunicorn configuration file. If you don't already have a configuration file, you can create one. For example, you could create a file named gunicorn_config.py:

def post_fork(server, worker):
    import mymodule
    mymodule.initialize()

bind = '127.0.0.1:8000'
workers = 4

In this configuration file, bind specifies the IP address and port that Gunicorn should bind to, and workers specifies the number of worker processes.

Then you would start Gunicorn with the --config or -c option to specify the configuration file:

gunicorn -c gunicorn_config.py myapp:app

In this command, myapp:app would be replaced with the import path to your actual WSGI application.

This way, each worker process will run the post_fork function after it's forked, and the initialize() function in your module will be called in each worker process, allowing the module to function correctly in the worker processes."

dMAC95 commented 1 year ago

More chatGPT insight

"When Gunicorn forks worker processes, each worker gets its own separate instance of the Flask application. These application instances are not created until after the workers have been forked, so they aren't directly available in the post_fork function. However, you can access the Flask application in each worker process once it starts handling requests.

If you need to perform some initialization on the Flask application in each worker process, one way to do this would be to use a middleware that performs the initialization the first time each worker handles a request.

Here's an example of what this could look like:

class InitializeAppMiddleware:
    def __init__(self, app):
        self.app = app
        self.initialized = False

    def __call__(self, environ, start_response):
        if not self.initialized:
            # Perform your initialization here.
            # This could involve calling a method on self.app,
            # or modifying self.app in some way.
            self.initialized = True
        return self.app(environ, start_response)

You would then wrap your Flask application with this middleware:

from flask import Flask
from mymodule import InitializeAppMiddleware  # Import the middleware

app = Flask(__name__)
app.wsgi_app = InitializeAppMiddleware(app.wsgi_app)

With this setup, the first time each worker process handles a request, it will run the initialization code in the middleware. This allows you to perform per-worker initialization that involves the Flask application.

Please remember to take thread-safety into account if you're using the gthread worker type, as the middleware could be accessed by multiple threads at the same time. In this case, you might need to use a lock to ensure that the initialization only happens once."

flask-dashboard / Flask-MonitoringDashboard

Flask monitoring dashboard not monitoring when the flask application is running through supervisor #296

Local

Production