flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.67k stars 224 forks source link

Flexflow attempts to register duplicate sharding functors #1376

Open rohany opened 5 months ago

rohany commented 5 months ago

Whatever deduplication is occurring here (https://github.com/flexflow/FlexFlow/blob/inference/src/mapper/mapper.cc#L165) is not sufficient to ensure that a unique set of sharding functors IDs is created. At 32 GPUs, i get errors from the runtime that a duplicate sharding functor has been registered. This could be improved with a better hash function ("hacky" solution), or to generate gauranteed fresh IDs from the legion runtime (https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h#L8977), and then store a map of machine view -> sharding id.

jiazhihao commented 5 months ago

@lockshaw @suranap do we have the new hash function integrated in the inference branch? It won't be an issue for 32 GPUs if so I think.

lockshaw commented 5 months ago

@jiazhihao Currently it's only merged into master (https://github.com/flexflow/FlexFlow/pull/1021), though it should be easy to port it to inference