Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.08k stars 760 forks source link

Hearbeat daemon creates a run out of production namespace #2029

Closed dkarlo2 closed 1 week ago

dkarlo2 commented 1 week ago

When heartbeat daemon is enabled, running a flow deployed in production namespace sometimes results in a run executed under user namespace. This often causes Object not in the current namespace errors, and often runs fail. This seems to happen when heartbeat daemon starts before the start step. I identified a probable cause to this issue - some Metaflow environment variables are missing when deploying daemon (e.g. METAFLOW_PRODUCTION_TOKEN) here. What I assume happens is that daemon here registers a run in user namespace instead of production namespace.

saikonen commented 1 week ago

Opened a PR with a possible fix for this if you want to give it a spin. I was unable to get my test flows to outright fail, but did notice that the runs sometimes register with incorrect project info as reported.

This being so timing dependant makes reliable testing a bit flaky, but from what I observed, adding the missing envs did seem to fix the issue.

saikonen commented 1 week ago

looping back on this, the complete fix was introduced in https://github.com/Netflix/metaflow/releases/tag/2.12.22 where the argo workflows daemon now correctly registers all project related system tags for a flow.