determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.04k stars 356 forks source link

Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

Open charles-viss opened 1 year ago

charles-viss commented 1 year ago

Describe your question

According to the docs records_per_epoch can be used to schedule validation and checkpoint frequencies in conjunction with the epoch scheduling unit in the config file: https://docs.determined.ai/latest/reference/reference-training/experiment-config-reference.html?highlight=gc%20policy#config-records-per-epoch. However, upon upgrading to a newer version of Determined, experiments seem to ignore the records_per_epoch field and instead define an epoch by the length of the dataloader. Is there a way to still use records_per_epoch to define epoch length instead?

azhou-determined commented 1 year ago

This was deprecated for PyTorch trials in 0.21.0. The doc you listed needs to be updated to reflect this, and I will make a ticket to do that.

Previously, this value was only necessary because our code made it impossible for us to determine the length of the dataset at the time we initialized the trial. But due to some refactoring, we no longer have this lifecycle issue and opted to not require users to give us this value. You are correct that we now use the chief worker's dataset length to determine this epoch length.

records_per_epoch is no longer supported for PyTorch trials. If you need the length of the epoch to be different than the length of the dataloader, you can configure the length in batches, or modify the length of the actual Dataloader.

charles-viss commented 1 year ago

This still seems undesirable though because as you mention in the docs, the length of a dataset/dataloader can vary depending on your augmentation or sampling strategy

garrett361 commented 1 year ago

Hi @charles-viss, can you expand upon your use case a bit more? Is it that you have a fixed dataset, but then run some augmentation so that the size of the dataset is effectively expanded? If so, is it safe to assume that the size of the expanded dataset is the same for every epoch, or will that also vary?

Thinking about how we might best address your scenario, but would like to make sure we understand the situation precisely, first.

garrett361 commented 1 year ago

Also, which version of determined are you currently running?

charles-viss commented 1 year ago

For example, one use case is training over a fixed dataset with or without category-weighted sampling. Because custom data samplers change the size of the dataloader, i've used the records_per_epoch parameter to set a universal epoch length and ensure consistent training times across experiments. The alternative like you've said is to do all scheduling in batches, which is doable, but would require rewrites in several experiments where I've used the epoch unit to control things like lr scheduling or augmentation strategies.

charles-viss commented 1 year ago

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided. This would at least enable backward compatibility. We are currently unable to fork experiments or use config files from before the version update due to this scheduling change.

charles-viss commented 1 year ago

We are currently using determined version 0.21.2

garrett361 commented 1 year ago

In 0.21.2, I think converting to using batches in all cases is the most robust solution, though I understand this will necessitate some unfortunate refactoring of your current code. Our apologies!

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided.

Thank you for the suggestion, we will consider adding such a feature.

One more question about your use case: are you using WeightedRandomSampler in the scenario described above? If so, can you use its num_samples argument to keep a fixed epoch size, rather than determined's records_per_epoch? I understand that this does not address the issue regarding forking, in any case.

charles-viss commented 1 year ago

In one situation we use a WeightedRandomSampler where that could work. In other cases though, such as when using a custom data sampler for triplet loss, the data samplers have essentially infinite length and a switch to batches for scheduling will be necessary