Open jseppanen opened 1 year ago
Hi. Yes, this is unfortunately "expected". In the local case, the workflow ID is basically the epoch and, if two flows of the same type (so same flow id) are launched within the same microsecond (and get the same epoch), there is a possibility of a race.
This design choice is by no means perfect but was to balance the trade-off between "unique IDs" and "portability and complexity of generating a unique ID" and "ugliness of run ID". Open to other suggestions. It is true that this is kind of a "hidden" bug in the sense that it is hard to detect.
yeah for me it was definitely unexpected as I'm launching a dozen runs with different parameters and only at some point realized that they don't actually start reliably. In such a use case the collision chance is way too high, as I'm getting them every couple of days. Maybe this is a known antipattern for Metaflow?
As to potential solution ideas:
I'm trying out if a workaround like this would help:
if __name__ == "__main__":
import time, random
time.sleep(random.random())
TrainCnnFlow()
@jseppanen - another pattern here would be to use a for-each
to run different tasks for different parameters (which can be passed in as a list to the flow) within a single execution.
Hello,
here's a small repro that demonstrates a race condition when launching parallel local runs with Metaflow 2.7.15
race_flow.py:
run_race_flows.sh:
output:
Expected output: the parallel jobs should print both AAA and BBB every time, but sometimes they print AAA twice or BBB twice.