Realized that adoption of luigi presented an opportunity for, and in some cases sort of required, a broader restructure of the repo. For that reason, I'm going to make these as PRs against a dev branch for intermediate review and incremental understand of the nature of the changes. I can go over the ideas here in more detail during our meeting today, but at the highest level this involves:
Consolidating all our libs into a single utils library
Hosting individual environments, each of which may contain one or several pipeline steps as entrypoint scripts, in dedicated directories next to this utils directory. Each of these will have its own associated container.
Building these scripts into luigi Tasks and composing these tasks into pipelines in the lightweight pipelines library (whose dependencies will probably be pared down to just luigi and pycondor). This is where tasks will get launched from.
Configs for tasks are controlled as much as possible by jsonargparse and stored as YAMLs. The reason for this being that this is what lightning (see below) uses for its built-in CLI, but it looks like it could largely replace the functionality of typeo, which would be great to deprecate as well. For tasks that need to override from the default configs, they can always subclass the relevant Task and specify the desired value for an arg in the command method (command-line override of config values is something I've wanted to do in typeo for a long time).
For the train project, I've gotten rid of the separate train library, which had become superfluous, and have opted to wrap thing up in lightning, which has managed not only to massively simplify the code and comes with its own automatic CLI, but runs ~10% faster and is automatically compatible with distributed training (which is already supported in this implementation) and W&B experiment tracking. This part of the overhaul was maybe not strictly necessary, but as I was consolidating things between the train library and project I realized that this would ultimately make things faster and better.
Remaining tasks for recreation of the original pipeline:
[x] Background event pooling during validation
[ ] Reintroduce SNR scheduling, though we tend to only to do it for like half of the first epoch, so I suspect it's not contributing too much
[x] ~Ensure that validation isn't happening on all GPUs when using distributed training~ Implemented distributed validation instead by distributing across shifts
[ ] Unit testing of new modules, leveraging existing unit tests as much as possible (e.g. for augmentation)
Not strictly necessary, but additional useful functionality
[x] Transition to WandbLogger for experiment tracking
[x] Consider creating dedicated data module to fully separate out training and data code. The difficulty here is just reuse of parameters/modules necessary between Model and DataModule, but it looks like this could be solved via argument linking.
[x] Using a fixed optimizer and schedule to give these their own section of the config. The problem here is that the scheduler needs to know how many total steps you're going to perform, and we don't know this ahead of time because the number of steps per epoch is a function of the number of waveforms, the waveform probability, and the batch size. However, it looks like the underlying implementation of argument linking in jsonargparse supports dynamic computation of arguments, so this could potentially solve this.
Realized that adoption of
luigi
presented an opportunity for, and in some cases sort of required, a broader restructure of the repo. For that reason, I'm going to make these as PRs against a dev branch for intermediate review and incremental understand of the nature of the changes. I can go over the ideas here in more detail during our meeting today, but at the highest level this involves:libs
into a singleutils
libraryutils
directory. Each of these will have its own associated container.Task
s and composing these tasks into pipelines in the lightweightpipelines
library (whose dependencies will probably be pared down to just luigi and pycondor). This is where tasks will get launched from.jsonargparse
and stored as YAMLs. The reason for this being that this is whatlightning
(see below) uses for its built-in CLI, but it looks like it could largely replace the functionality oftypeo
, which would be great to deprecate as well. For tasks that need to override from the default configs, they can always subclass the relevantTask
and specify the desired value for an arg in thecommand
method (command-line override of config values is something I've wanted to do intypeo
for a long time).For the
train
project, I've gotten rid of the separatetrain
library, which had become superfluous, and have opted to wrap thing up inlightning
, which has managed not only to massively simplify the code and comes with its own automatic CLI, but runs ~10% faster and is automatically compatible with distributed training (which is already supported in this implementation) and W&B experiment tracking. This part of the overhaul was maybe not strictly necessary, but as I was consolidating things between the train library and project I realized that this would ultimately make things faster and better.Remaining tasks for recreation of the original pipeline:
Not strictly necessary, but additional useful functionality
WandbLogger
for experiment trackingModel
andDataModule
, but it looks like this could be solved via argument linking.jsonargparse
supports dynamic computation of arguments, so this could potentially solve this.