We would like to have a batching scheme that will balance batches AND ensure that a whole epoch covers the entire dataset. This is a lot more complicated than doing weighted/pseudorandom batching since you have to keep track of which labeled segments you've already handled. One strategy would be to generate all possible batch items and shuffle, but this could be a big footgun if there are MANY possible batch items. One way to cut down on the number of batches is to use partially overlapping windows of a particular duration, and have an index of label classes to windows.
This is a nice to have since it's very fiddly and requires a lot of validation to make sure it's correct and performant. For the time being the pseudorandom/online batching strategy works well enough for the kinds of models we're training.
We would like to have a batching scheme that will balance batches AND ensure that a whole epoch covers the entire dataset. This is a lot more complicated than doing weighted/pseudorandom batching since you have to keep track of which labeled segments you've already handled. One strategy would be to generate all possible batch items and shuffle, but this could be a big footgun if there are MANY possible batch items. One way to cut down on the number of batches is to use partially overlapping windows of a particular duration, and have an index of label classes to windows.
This is a nice to have since it's very fiddly and requires a lot of validation to make sure it's correct and performant. For the time being the pseudorandom/online batching strategy works well enough for the kinds of models we're training.