Open axnedergaard opened 6 years ago
So, I've been fighting pretty much with the debugging. Despite that, at the moment of writing the model does not perform better than random guessing. The main part of it? Probably the discriminator completely overwhelming the generator. I'm doing some tweaks with all the pre-training stuff and so on, but I'm pretty much outplayed by it... so not sure on where to follow. For this reason, if anyone has the change to try stuff (we all can use the cluster at the same time) or maybe do some last minute magic or whatever. That'd be awesome. For now, I'm gonna keep fighting it.
As side note, we haven't used at any moment the possibility of having a fully supervised step by using the two validation sets we have. Maybe that could be useful, somewhat. Also, maybe there are alternative tweaks that can be really helpful at making the generator stronger.
In any case, peace dudes, we'll make it somehow :vulcan_salute:
So I'm mainly busy with writing the report, and especially if no one else works on that I don't think I will have any time for anything else. It is 30% of the grade so seems like an effective strategy to at least max out our grade there.
However, my 2 cents are: The cosine similarity loss term in generator loss is probably important (I will implement that properly now so it doesn't cause negative loss) I have not set the hyperparameters to mirror the paper so if someone else has not done this it is worth doing (I will do that now) Perhaps pretraining by minimizing the cosine distance is the wrong approach and we need to use a different distance/loss metric (I know Lama worked on this yesterday) It is important that we pretrain for long enough (looking at the generator pretrain loss graph in Tensorboard helpful here) Perhaps the the document and sentence RNN weights should be updated during generator training (I know Lama worked on this yesterday) There are other hyperparameters that might need tweaking, in particular the "stop word reached" cutoff, the "add discriminator noise standard deviation", maybe others
There might be a good counter-argument but immediately the idea of doing supervised training the validation set seems like it will cause overfitting and lower our score on the test set. Another way to use our data more that comes to mind is to also use the 4 input sentences during pretraining, although I have a feeling it's not a good idea.
I completed the first two points about cosine similarty and hyperparameters.
I imagine most of our time debugging will be spent on this