Netflix / metaflow-service

:rocket: Metadata tracking and UI service for Metaflow!
http://www.metaflow.org
Apache License 2.0
193 stars 71 forks source link

Timeout on endpoint `/flows/{flow_id}` #448

Closed Joseda8 closed 2 weeks ago

Joseda8 commented 2 weeks ago

I'm using Metaflow in Python to get data from a specific run. The way to do this is very explicit and clear on the Metaflow documentation:

Flow(flow_name)[run_id]

This accesses to the endpoint /flows/{flow_id}. Very often this operation fails with a 504 Gateway Time-out. This is a general issue with any request sent to the specified METAFLOW_SERVICE_URL. Is there a way to extend the timeout or implement a retry mechanism?

Update: It seems there's a TODO comment to make this happen in: services/data/postgres_async_db.py

Joseda8 commented 2 weeks ago

I believe a scalable solution could be to modify the file services/metadata_service/server.py as follows:

    # Define parameters configurable by the user
    db_config_params = {
        "timeout": os.environ.get("MF_METADATA_DB_TIMEOUT", 60),
    }
    the_app = app(loop, DBConfiguration(**db_config_params), path_prefix=PATH_PREFIX)

I'd like to open a pull request with this change.

saikonen commented 2 weeks ago

based on https://github.com/Netflix/metaflow-service/blob/master/services/utils/__init__.py#L264 the environment variable should already be supported. Is this not working?

As I noted on Slack as well, there are some scaling issues with certain api routes that a simple timeout will not resolve, but a thorough fix is coming for these.

In the mean time, you should be able to access a specific run directly with

from metaflow import Run
run = Run("FlowName/run_id")

which skips the problematic endpoint

Joseda8 commented 2 weeks ago

@saikonen, thanks a lot for the explanation. I confirm the existence of the variable MF_METADATA_DB_TIMEOUT. Thanks for pointing it out! Also thanks for the workaround on the usage of Run("FlowName/run_id").