eth-easl / modyn

Modyn is a research-platform for training ML models on growing datasets.
MIT License
29 stars 3 forks source link

Supervisor-Selector Interaction #64

Closed MaxiBoether closed 1 year ago

MaxiBoether commented 1 year ago

Currently, the Supervisor and Selector are separate components. We just boot some form of Selector which continously maintains some form of current dataset. However, this has some problems:

  1. The Selector depends on the pipeline. The Selector needs to be started when we start a pipeline, because the pipeline defines the Selector type.
  2. If the Supervisor and Selector both independently query the storage for new data, we need to synchronize, which is difficult. Due to timing issues, otherwise, the Supervisor might trigger training at timestamp t, but has seen a different state of data than the Selector.
  3. Implementing experiment mode/replay of data is difficult, if the Selector only reflects the "latest" state of what we want to train on. Replay mechanisms would have to be replicated in the Selector instead of just in the Supervisor.
  4. Currently, we only support online algorithms (e.g., continual learning/finetuning), but no offline algorithms (static coresets, shapley values) that are synchronized to a trigger.

The solution to these problems is that

In pseudocode, it could look like this.

Simulation from timestamp START to timestamp END
optional input: timestamps X, Y on which we pretrain (X and Y < START)

1. run_pretraining for data from timestamp X to timestamp Y (optional, no Selector involved)
2. last_training = current_timestamp()
3. data_points = storage.get_datapoints(START, END)
4. for data_point in data_points: # actually batched in some form
          selector.inform(data_point) # the selector gets informed on this data, depending on the implementation, the selector will just remember it (finetuning selector), or do some other   calculations (GDumb)
           if training triggered:
                 trigger_time = timestamp of datapoint that caused trigger
                 selector.inform_about_trigger(trigger_time)
                 gpu_node.train(last_training, trigger_time)
                 last_training = trigger_time

In the moment where we trigger, some Selectors will do nothing (e..g, GDumb, finetuning, ...) because there internal buffer is updated per data points, while other Selectors (e.g., offline coresets) will just then be able to calculate the actual training set for the training job that then starts. For the non-experiment mode, this looks similar, we just don't have a tight loop over all data points from START to END, but instead we always fetch the latest data from storage and iterate over the newest data points there (which is why I wrote "batched" in the pseudocode above, these points also come in in batches).

After #46 is merged, this should be tackled.

MaxiBoether commented 1 year ago

After #111 is merged, we will need to call the actual gRPC endpoints in the Supervisor. Then, this issue here can be closed.