how to set up min_epochs and max_epochs

Wulin-Tan commented 5 months ago

Hi,lightning pose team I tried lightning pose with temporal model. It seemed that the model would converge at around 100 epochs. And the default setting is around 100-300 epochs in the example. Would a few hundreds epochs of training be enough in temporal model?

And how about basic model? Still a few hundred epochs?

Usually when I used deeplabcut, I have to go to 200k - 500k iterations or even more. I am not sure what relationship would be between epoch in LP and the iteration in DLC. But it seems that LP could converge or finish training much faster in my dataset.

themattinthehatt commented 5 months ago

Hi @Wulin-Tan I would recommend starting with the parameters in the default configuration files - min/max epochs set at 300 (so that you train for exactly 300 epochs). This is the setting we use for many of our experiments and have found it to work well across a range of dataset types (pupil, head-fixed mouse, freely moving fish/mouse, etc.) and model types (baseline, context, semi-supervised, etc.).

In our case an "epoch" is one full pass through the dataset. As an example, if you have 400 labeled frames, and train_batch_size=10, then your dataset is divided into 400/10 = 40 batches. One "batch" in this case is equivalent to one "iteration" for DLC. Therefore, 300 epochs, at 40 batches per epoch, is equal to 300*40=12k total batches (or iterations).

Both the iteration- and epoch-based approaches are valid ways to think about training a network, just a bit different in practice. Hopefully the example above clearly illustrates how to convert between the two for your particular dataset (which, again, will depend on the total number of labeled frames and the batch sizes you use for both Lightning Pose and DLC). Please let me know if you have additional questions, this is a great question and I will make sure to update the documentation to discuss the conversion between epochs and iterations.

Wulin-Tan commented 5 months ago

hi @themattinthehatt when I set the batch size as 4,8,16 for my dataset, I always get around 2 minutes to finish an epoch. But why didn't it get faster with larger batch size? does batch size here means the images being processed in the same round?

themattinthehatt commented 5 months ago

I'd say this isn't too surprising - with larger batch sizes you have fewer batches per epoch, but it also takes longer to assemble one batch. One thing you could try, depending on your workstation specs, is to increase the value for the parameter training.num_workers in the config file; this specifies how many threads are used to construct batches in the data loader. The default is 4, but if you have more CPU cores you could bump this up to 8, say, and see if that speeds up training.

I will also note that changing the batch size in training.train_batch_size only affects the processing of labeled data. If you are training a temporal model that also uses unsupervised losses, there will be additional overhead for loading and processing the video data. This entire process is controlled by the parameters under the dali heading in the config file, so you might want to play around with those as well. For example, changing dali.base.train.sequence_length controls the batch size of the unlabeled data, and making this number smaller will result in smaller unlabeled batches, and potentially faster training (though bigger batches are usually better here!).

We have recently updated our manuscript on biorxiv, and there is a new supplementary figure that discusses training time with different model types and batch sizes; please see Supplementary Figure 9 here if you're interested.

Wulin-Tan commented 5 months ago

Hi, @themattinthehatt I tried 'training.num_workers'. It is exactly what you suggested and the speed is amazing. but isn't the model train on GPU? why would CPU num_workers can help so much? And what num_workers do you suggest? In terms of train.sequence.length, which number would you suggest to start from? now I just keep the default setting.

themattinthehatt commented 5 months ago

Great, I'm glad you were able to see some speedup! You are correct that the model is trained on the GPU. But the frames, which are saved on disk, need to be gathered and placed into batches, which are then fed to the model on the GPU. If you just have a single CPU worker constructing the batches, the model will process one batch before the CPU worker can construct the next batch, and so the model sits idle for periods of time, waiting for the next batch. Multiple CPU workers allows for a constant stream of batches to be fed to the GPU so that it never sits idle, and that's where you see the speedup coming from.

The best value for num_workers will depend on many factors - your dataset, model type, hardware details (CPU/GPU), etc. You can try a few different values and see what works best for you. If you saw a performance increase going from 4 to 8 you might try 12 next; but you might also find that performances maxes out around 8.

For sequence_length, the most important factor to keep in mind is GPU memory. You can see how much memory your current training is taking by running the following command from a new terminal:

watch -n 0.5 nvidia-smi

This will show you GPU information, updated every 0.5 seconds (that's the -n 0.5 part), and you can see how much memory is being used on the GPU. The larger your unlabeled batches (governed by dali.base.train.sequence_length) the better, since your network will see more unlabeled data. You can try to keep increasing this until you get an "Out of Memory" error. Again, the point at which you hit this error depends a lot on your data and hardware. In general the default is good, and I would never go below 6 or 8.

danbider / lightning-pose

how to set up min_epochs and max_epochs #140