remove epochs and only use batches

tharvik commented 5 months ago

after discussion, look like epochs are not really needed, we can directly use batches. so going from "round -> epoch -> batch" to have "round -> batch". that would give more direct control on

[ ] rework datasets to generate samples (ie batches) of the loaded data
- randomized over the whole loaded data as to avoid skewing training
[ ] remove limitation in gpt-tfjs of running at most five batches
- superseed by number of batch per round
[ ] remove TrainingInformation.epochs & EpochLogs
[ ] in Task, use rounds as the top level count of run, then batchesPerRound (renamed from roundDuration)
[ ] flatten generators from {Trainer,Model}.fit

JulienVig commented 5 months ago

Would it be possible to still support the concept of epochs somehow? If I'm going to train a model on a dataset I will reflect in terms of epochs rather than batches (or round) for sure, so I would find it confusing and limiting to not be able to know how the nb of batches I have to choose translates into epochs. What about allowing to specify either batches or epochs? (annoying from an implementation standpoint but could be nice as UX?)

tharvik commented 4 months ago

Would it be possible to still support the concept of epochs somehow? If I'm going to train a model on a dataset I will reflect in terms of epochs rather than batches (or round) for sure, so I would find it confusing and limiting

eventhough "epoch" is used throughout libraries, I don't think it is really important for training a model. from a network perspective, we only need the clients to train for a certain amount of time on their data, not a specific amount of epoch (nor batches but that's for another time). I've the feeling that I'm missing some deeper ML knowledge here, why do you find it limiting? do the model need to know that it has now see "all the dataset" (which is the meaning of epoch for me)?

not be able to know how the nb of batches I have to choose translates into epochs. What about allowing to specify either batches or epochs? (annoying from an implementation standpoint but could be nice as UX?)

this changes quite fundamentally the concept of batch: it would be now a fixed-size random extract of a dataset. I'll use sample from now on as I find it clearer. there is not really a translation of samples to epoches, as it is random now. to have a probable (>=50%) epoch of the dataset, one could use

const sampleCount = epochCount * dataset.size / sampleSize

this way, we can also avoid having both implementation in discojs and only have to computation outside of discojs.

martinjaggi commented 4 months ago

if you want, it can be possible to offer the best of both worlds. it would only affect the UI, not functionality: the user can specify their round duration either in epochs (but that should allow fractional values such as 0.2) or in batches=steps. in the code we'd always use batches afterwards

this relies on the assumption that the dataset overall size is known (or also specified in the UI)

On Mon, Jul 1, 2024 at 2:31 PM Valérian Rousset @.***> wrote:

Would it be possible to still support the concept of epochs somehow? If I'm going to train a model on a dataset I will reflect in terms of epochs rather than batches (or round) for sure, so I would find it confusing and limiting

eventhough "epoch" is used throughout libraries, I don't think it is really important for training a model. from a network perspective, we only need the clients to train for a certain amount of time on their data, not a specific amount of epoch (nor batches but that's for another time). I've the feeling that I'm missing some deeper ML knowledge here, why do you find it limiting? do the model need to know that it has now see "all the dataset" (which is the meaning of epoch for me)?

not be able to know how the nb of batches I have to choose translates into epochs. What about allowing to specify either batches or epochs? (annoying from an implementation standpoint but could be nice as UX?)

this changes quite fundamentally the concept of batch: it would be now a fixed-size random extract of a dataset. I'll use sample from now on as I find it clearer. there is not really a translation of samples to epoches, as it is random now. to have a probable (>=50%) epoch of the dataset, one could use

const sampleCount = epochCount * dataset.size / sampleSize

this way, we can also avoid having both implementation in discojs and only have to computation outside of discojs.

— Reply to this email directly, view it on GitHub https://github.com/epfml/disco/issues/689#issuecomment-2200022207, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDIMRZST7BBTAXIDSVM4T3ZKFD3PAVCNFSM6AAAAABKBSHUNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGAZDEMRQG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

JulienVig commented 4 months ago

the user can specify their round duration either in epochs (but that should allow fractional values such as 0.2) or in batches=steps.

Yes! That's exactly what I meant

we only need the clients to train for a certain amount of time on their data

As a user, how I would choose what is a "certain amount of time" would depend on the concept of epochs. Ideally I have a sizeable and manageable amount of data and I will want to train for exactly one epoch: I took advantage of all the data available and the model only saw each data point once, so less overfitting.

If I can only choose a number of batches (= samples), I will not know if the number of batches I choose represents more or less than one pass over the dataset.

In practice there's usually not enough data and I will want to do multiple passes, or I have may too much data them I would like to do a fraction of epoch (in which case specifying a number of samples would be useful)

Essentially, when I think about how much data I want the model to see, I reason in terms of number of passes over the dataset (=epochs) and not in terms of samples (=batches =samples). That may be very personal and that's why I think being able to choose would be nice

tharvik commented 4 months ago

okay, so we need support for both partial dataset (sampled based) and full dataset (one epoch). so when someone ask to train for

1.2 epoches, that would mean one iteration over the whole dataset (each line once) and a sampling of 20% of the dataset
0.5 epoches, only sampling of 50% of dataset
3 epoches, only three full iterations over the whole dataset

that does requires that we change discojs itself, as we will in fact have two types PartialDataset and FullDataset, both implementing Dataset (batch generator). in the end, the training will only be on batches so we will drop the explicit epoch layer and chain the various Dataset implementations.

is that what you had in mind?

JulienVig commented 4 months ago

Yes! I expect that most cases would either be a fraction less than one or an integer number of epochs though

martinjaggi commented 4 months ago

just a comment on random sampling: either it should be done in both cases (full epochs and fractional ones), or not at all. in the latter case this means that we'd assume the dataset is shuffled already. (if that's an assumption would be good to state in the readmes and code.). btw if it's shuffled, you don't need sampling but can just go with the first 20% of that ordered dataset.

so maybe it's easiest to do dataset shuffling in the preprocessing, or then not do any sampling/shuffling ever?

in terms of terminology, i'd say batch size is more clear than sample/sample size (more robust in meaning in all scenarios)

tharvik commented 4 months ago

just a comment on random sampling: either it should be done in both cases (full epochs and fractional ones), or not at all.

in my understanding, sampling can potentially return previously seen element in the same iteration (might even return twice the same element in a single batch, which very low probability). so that's incompatible with full epoch (all lines once). I constract that with shuffling which can be applied to full epoch and returns every element of the dataset once but in a random order. now with theses definitons out of the way, I agree that every full dataset should be shuffled. and as partial dataset are sampling their elements, it's already random. does that makes sense?

btw if it's shuffled, you don't need sampling but can just go with the first 20% of that ordered dataset. so maybe it's easiest to do dataset shuffling in the preprocessing, or then not do any sampling/shuffling ever?

that means that the model will always train on the same part of the dataset, is that an issue?

FWIW, having whole suffling is a bit costly memory wise, as we have to keep track of the remaining elements.

in terms of terminology, i'd say batch size is more clear than sample/sample size (more robust in meaning in all scenarios)

yep, I agree, batch makes more sense now.

epfml / disco

remove epochs and only use batches #689