The data generation script has some weird stuff going on, and the main notebook works around that by using minibatch-based evaluation. If the validation set is 99, this breaks unless the training batch size is at least 99. The loss is computed using 3 minibatches with size 32 and one with size 3. But actually the last one has still size 32 and the last 29 examples all have labels 0, so after some training this part of the validation set will have a huge loss. Everything works anyway, but the validation loss (NLL) is overestimated a lot.
Andrea: