Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.12k stars 765 forks source link

Race condition in local flow launcher #1237

Open jseppanen opened 1 year ago

jseppanen commented 1 year ago

Hello,

here's a small repro that demonstrates a race condition when launching parallel local runs with Metaflow 2.7.15

race_flow.py:

from metaflow import FlowSpec, Parameter, step

class RaceFlow(FlowSpec):

    message = Parameter("message")

    @step
    def start(self):
        self.message2 = self.message
        self.next(self.print_message)

    @step
    def print_message(self):
        print(self.message2)
        self.next(self.end)

    @step
    def end(self):
        print(f"printed message: {self.message2}")

if __name__ == "__main__":
    RaceFlow()

run_race_flows.sh:

#!/bin/sh

status="OK"

while [ "$status" == "OK" ]; do
    (
        python race_flow.py run --message AAA &
        python race_flow.py run --message BBB &
        wait
    ) > run_race_flows.log 2>&1

    status=$(grep -q AAA run_race_flows.log && grep -q BBB run_race_flows.log && echo OK)
done

output:

❯ sh run_race_flows.sh
❯ cat run_race_flows.log
Metaflow 2.7.15Metaflow 2.7.15 executing RaceFlow executing RaceFlow for user:jarno.seppanen for 
user:jarno.seppanen
Validating your flow...
Validating your flow...
    The graph looks good!
Running pylint...
    The graph looks good!
Running pylint...
    Pylint is happy!    Pylint is happy!

2023-01-18 11:27:35.695 Workflow starting (run-id 1674034055691361):
2023-01-18 11:27:35.695 Workflow starting (run-id 1674034055691361):
2023-01-18 11:27:35.699 [1674034055691361/start/1 (pid 12699)] Task is starting.
2023-01-18 11:27:35.699 [1674034055691361/start/1 (pid 12700)] Task is starting.
2023-01-18 11:27:36.098 [1674034055691361/start/1 (pid 12699)] Task finished successfully.
2023-01-18 11:27:36.098 [1674034055691361/start/1 (pid 12700)] Task finished successfully.
2023-01-18 11:27:36.102 [1674034055691361/print_message/2 (pid 12706)] Task is starting.
2023-01-18 11:27:36.102 [1674034055691361/print_message/2 (pid 12705)] Task is starting.
2023-01-18 11:27:36.424 [1674034055691361/print_message/2 (pid 12706)] 2023-01-18 11:27:36.424 BBB
[1674034055691361/print_message/2 (pid 12705)] BBB
2023-01-18 11:27:36.488 [1674034055691361/print_message/2 (pid 12705)] Task finished successfully.
2023-01-18 11:27:36.489 [1674034055691361/print_message/2 (pid 12706)] Task finished successfully.
2023-01-18 11:27:36.492 [1674034055691361/end/3 (pid 12711)] 2023-01-18 11:27:36.492 Task is starting.
[1674034055691361/end/3 (pid 12712)] Task is starting.
2023-01-18 11:27:36.815 [1674034055691361/end/3 (pid 12711)] printed message: BBB
2023-01-18 11:27:36.815 [1674034055691361/end/3 (pid 12712)] printed message: BBB
2023-01-18 11:27:36.878 2023-01-18 11:27:36.878 [1674034055691361/end/3 (pid 12711)] [1674034055691361/end/3 (pid 12712)] Task finished successfully.
Task finished successfully.
2023-01-18 11:27:36.879 2023-01-18 11:27:36.879 Done!
Done!

Expected output: the parallel jobs should print both AAA and BBB every time, but sometimes they print AAA twice or BBB twice.

romain-intel commented 1 year ago

Hi. Yes, this is unfortunately "expected". In the local case, the workflow ID is basically the epoch and, if two flows of the same type (so same flow id) are launched within the same microsecond (and get the same epoch), there is a possibility of a race.

This design choice is by no means perfect but was to balance the trade-off between "unique IDs" and "portability and complexity of generating a unique ID" and "ugliness of run ID". Open to other suggestions. It is true that this is kind of a "hidden" bug in the sense that it is hard to detect.

jseppanen commented 1 year ago

yeah for me it was definitely unexpected as I'm launching a dozen runs with different parameters and only at some point realized that they don't actually start reliably. In such a use case the collision chance is way too high, as I'm getting them every couple of days. Maybe this is a known antipattern for Metaflow?

As to potential solution ideas:

I'm trying out if a workaround like this would help:

if __name__ == "__main__":
    import time, random
    time.sleep(random.random())
    TrainCnnFlow()
savingoyal commented 1 year ago

@jseppanen - another pattern here would be to use a for-each to run different tasks for different parameters (which can be passed in as a list to the flow) within a single execution.