Open bpatient78 opened 3 months ago
Question 1: Is network for inter-DC AI training in the scope of HP-WAN?
The discussion thread can be found at https://mailarchive.ietf.org/arch/msg/hp-wan/aAK3Mwq7YW4viFNUT5GNjqo9Ywg/
Some points expressed during discussion by far a) Since WAN can be used for inter-DC communication, training models in multiple DCs (distributed learning) can be a use case for hp-wan to explore. https://mailarchive.ietf.org/arch/msg/hp-wan/g81y_QxY7Oh0Uz2lT_PpncpIlps/
b) There're two types of inter-DC AI traffic, one is about ingetsing raw input(Offline data transmission) which contains datasets transmission and model deployment, it's a volume transfer problem with relatively limited latency issues. Another is traffic during the training process(Online data transmission) which is more sensitive to latency. They should be discussed separately.
https://mailarchive.ietf.org/arch/msg/hp-wan/A6SVfs4r6webr2a-yF0wAU_FidI/ https://mailarchive.ietf.org/arch/msg/hp-wan/uOq8Flmg0ekUSuvMJrm4wm8EH04/ https://mailarchive.ietf.org/arch/msg/hp-wan/J7mSfqSjEQudfbF331X3Tn7Pa4s/
References/Background information on inter-DC AI training mentioned during discussion [1]https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/?utm_source=perplexity [2]https://dl.acm.org/doi/10.1145/3651890.3672233