Open charles-viss opened 1 year ago
This was deprecated for PyTorch trials in 0.21.0. The doc you listed needs to be updated to reflect this, and I will make a ticket to do that.
Previously, this value was only necessary because our code made it impossible for us to determine the length of the dataset at the time we initialized the trial. But due to some refactoring, we no longer have this lifecycle issue and opted to not require users to give us this value. You are correct that we now use the chief worker's dataset length to determine this epoch length.
records_per_epoch
is no longer supported for PyTorch trials. If you need the length of the epoch to be different than the length of the dataloader, you can configure the length in batches, or modify the length of the actual Dataloader
.
This still seems undesirable though because as you mention in the docs, the length of a dataset/dataloader can vary depending on your augmentation or sampling strategy
Hi @charles-viss, can you expand upon your use case a bit more? Is it that you have a fixed dataset, but then run some augmentation so that the size of the dataset is effectively expanded? If so, is it safe to assume that the size of the expanded dataset is the same for every epoch, or will that also vary?
Thinking about how we might best address your scenario, but would like to make sure we understand the situation precisely, first.
Also, which version of determined
are you currently running?
For example, one use case is training over a fixed dataset with or without category-weighted sampling. Because custom data samplers change the size of the dataloader, i've used the records_per_epoch
parameter to set a universal epoch length and ensure consistent training times across experiments. The alternative like you've said is to do all scheduling in batches, which is doable, but would require rewrites in several experiments where I've used the epoch
unit to control things like lr scheduling or augmentation strategies.
One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch
parameter is provided. This would at least enable backward compatibility. We are currently unable to fork experiments or use config files from before the version update due to this scheduling change.
We are currently using determined version 0.21.2
In 0.21.2, I think converting to using batches
in all cases is the most robust solution, though I understand this will necessitate some unfortunate refactoring of your current code. Our apologies!
One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a
records_per_epoch
parameter is provided.
Thank you for the suggestion, we will consider adding such a feature.
One more question about your use case: are you using WeightedRandomSampler
in the scenario described above? If so, can you use its num_samples
argument to keep a fixed epoch size, rather than determined's records_per_epoch
? I understand that this does not address the issue regarding forking, in any case.
In one situation we use a WeightedRandomSampler
where that could work. In other cases though, such as when using a custom data sampler for triplet loss, the data samplers have essentially infinite length and a switch to batches for scheduling will be necessary
Describe your question
According to the docs
records_per_epoch
can be used to schedule validation and checkpoint frequencies in conjunction with theepoch
scheduling unit in the config file: https://docs.determined.ai/latest/reference/reference-training/experiment-config-reference.html?highlight=gc%20policy#config-records-per-epoch. However, upon upgrading to a newer version of Determined, experiments seem to ignore therecords_per_epoch
field and instead define an epoch by the length of the dataloader. Is there a way to still userecords_per_epoch
to define epoch length instead?