Currently, the Supervisor and Selector are separate components. We just boot some form of Selector which continously maintains some form of current dataset. However, this has some problems:
The Selector depends on the pipeline. The Selector needs to be started when we start a pipeline, because the pipeline defines the Selector type.
If the Supervisor and Selector both independently query the storage for new data, we need to synchronize, which is difficult. Due to timing issues, otherwise, the Supervisor might trigger training at timestamp t, but has seen a different state of data than the Selector.
Implementing experiment mode/replay of data is difficult, if the Selector only reflects the "latest" state of what we want to train on. Replay mechanisms would have to be replicated in the Selector instead of just in the Supervisor.
Currently, we only support online algorithms (e.g., continual learning/finetuning), but no offline algorithms (static coresets, shapley values) that are synchronized to a trigger.
The solution to these problems is that
[ ] The Supervisor boots the Selector for the pipeline (because pipelines are started together with the Supervisor). This means that the Selector component is stalling/waiting for requests and spawning Selector instances on demand for trainings.
[ ] The Supervisor informs the Selector about new data and triggers (this allows for replay/experiments as well and avoids synchronization problems, and easily allows for offline algorithm support)
In pseudocode, it could look like this.
Simulation from timestamp START to timestamp END
optional input: timestamps X, Y on which we pretrain (X and Y < START)
1. run_pretraining for data from timestamp X to timestamp Y (optional, no Selector involved)
2. last_training = current_timestamp()
3. data_points = storage.get_datapoints(START, END)
4. for data_point in data_points: # actually batched in some form
selector.inform(data_point) # the selector gets informed on this data, depending on the implementation, the selector will just remember it (finetuning selector), or do some other calculations (GDumb)
if training triggered:
trigger_time = timestamp of datapoint that caused trigger
selector.inform_about_trigger(trigger_time)
gpu_node.train(last_training, trigger_time)
last_training = trigger_time
In the moment where we trigger, some Selectors will do nothing (e..g, GDumb, finetuning, ...) because there internal buffer is updated per data points, while other Selectors (e.g., offline coresets) will just then be able to calculate the actual training set for the training job that then starts. For the non-experiment mode, this looks similar, we just don't have a tight loop over all data points from START to END, but instead we always fetch the latest data from storage and iterate over the newest data points there (which is why I wrote "batched" in the pseudocode above, these points also come in in batches).
Currently, the Supervisor and Selector are separate components. We just boot some form of Selector which continously maintains some form of current dataset. However, this has some problems:
The solution to these problems is that
In pseudocode, it could look like this.
In the moment where we trigger, some Selectors will do nothing (e..g, GDumb, finetuning, ...) because there internal buffer is updated per data points, while other Selectors (e.g., offline coresets) will just then be able to calculate the actual training set for the training job that then starts. For the non-experiment mode, this looks similar, we just don't have a tight loop over all data points from START to END, but instead we always fetch the latest data from storage and iterate over the newest data points there (which is why I wrote "batched" in the pseudocode above, these points also come in in batches).
After #46 is merged, this should be tackled.