epfml / disco

DISCO is a code-free and installation-free browser platform that allows any non-technical user to collaboratively train machine learning models without sharing any private data.
https://discolab.ai
Apache License 2.0
155 stars 26 forks source link

Allow decentralized users to join late and catch up #775

Closed JulienVig closed 1 month ago

JulienVig commented 2 months ago

Closes #718

Decentralized issues

Currently, we can only train decentralized with the exact number of peers specified in minNbOfParticipants.

Solution Implemented

  1. During onRoundBeginCommunication, peers manifest their interest to join the current round: they send a PeerJoinsRound to the server. The server keeps a list of peers wanting to join (but doesn't reply with a peer list as it currently does).
  2. The peers start training locally without the round's peer list
  3. When peers are done training locally, they notify the server with a PeerIsReady, i.e. ready to exchange weight updates. The server waits until all the peers that sent a PeerJoinsRound sends their PeerIsReady and then send the round's peer list. i. This allows for some time for peers to join the round. To prevent new peers from continually joining and waiting for new peers to be ready, the server can stop including peers in this round as soon as one peer is ready (and include them in the next). ii. Peers can leave and notify the server before. As long as the peer list hasn't been sent, peers can join and leave without it being a problem.
  4. Upon receiving the peer list, peers establish p2p connection and start exchanging weight updates.

Refactoring