epfml / ML_course

EPFL Machine Learning Course, Fall 2024
https://www.epfl.ch/labs/mlo/machine-learning-cs-433/
1.27k stars 910 forks source link

Modified batch_iter function #79

Closed JuanSapriza closed 1 year ago

JuanSapriza commented 1 year ago

For a while a had the same problem as Jakhongir's question in the forum: The implementation of batch_iter forces you to reshuffle if your choice of batch_size and num_batches is not happy (e.g. too large batches and/or too little batches and/or more batches than which fit inside the dataset).

In order to comply with the test from Project 1 (which has one single data point but requires 2 iterations) and use the batch_iter function you would need to do some inefficient nested for, shuffling the (single) data twice.

I propose this different approach (shuffling indexes instead of the data-points). It is compliant with the project tests and way more efficient than the previous one (given some reasonable circumstances). Furthermore, it allows to have as many iterations as desired, regardless of the dataset size.

image

The implementation has certain caveats regarding the randomness inside the batch, but the obvious workaround is using batch_size = 1, which should still be (slightly) more efficient.

Regards, thank you and keep up the awesome work!

martinjaggi commented 1 year ago

thanks a lot, we'll check it very soon

laraorlandic commented 1 year ago

I checked that the updated function still produces the desired behavior on lab 2 and the project 1 grading tests, and it all looks good! Great work, Juan!