Dear author, I am a little curious about the correspondence between the NIPS paper and the actual code

BaeHann commented 1 year ago

Hi, Mr benearnthof: Really sorry to inflict myself on you again! I am confronted by two new questions.

I am a little confused by the presence of the "supervisor" since in the original paper, the supervised loss seems to be carried out on the Generator. Furthermore, the input of "Supervisor" doesn't entail the noise Z at all, which may conflict with the formula in the paper. (I am not a native speaker of English, and It's absolutely not my intention to challenge or offend you. My purpose is simply to consult you about my questions.)
I can hardly distinguish which part of the model corresponds to the calculation of conditional distribution and the matching of these conditionals. I've heard that the output of a neural network can be perceived as the mean of the label given the input so that the MSE loss function is actually a MLE under Gaussian distribution. However, I can still not understand which part of the model and which MSE (there are so many MSEs in the code) undertake these tasks? Could you be so kind as to enlighten me about the statistical foundation and the corresponding manifestation in the code?

Thank you a lot and wish you every success in the future! 屏幕截图 2023-03-15 211512

benearnthof commented 1 year ago

I am not sure If I can clear up all your questions, as this was a winter project for a class in university and I specialize mostly in image models and a bit of NLP now haha. If i remember correctly the authors wanted to allow the generator to be trained with more information than simply the binary decision of the discriminator saying "this data is real" or "this data is fake" so they use an auxiliary embedding network, that serves to compute "embeddings" => a long vector representation of the original data. To adress the second point of your question, the code calculates four losses and optimizes each of them separately, to train the generator, embedding network, discriminator and recovery network jointly. These are all calculated on their own and optimized with distinct optimizers initialised with their own respective learning rates. I would have to read the paper again for more info, but I do agree that the approach in this paper, the original tensorflow implementation and our pytorch implementation are all pretty convoluted 😅

The loss calculations can be found here https://github.com/benearnthof/TimeGAN/blob/0c8ab7133eb41369ce2b2815e07915fd8651e27f/modules_and_training.py#L216 they all boil down to Binary cross entropy and Mean squared error loss, because we assume that the latent representations we obtain from the embedding & recovery auxiliary networks are informative enough to make these losses sufficiently smooth for training. It should be noted that during our experiments the training was not stable so I'd recommend WGAN to be honest!

This, of course, also depends on the data you want to model. Hope I could help!

BaeHann commented 1 year ago

Thank you very much for your detailed reply. I think I should clarify what I am confused about.

As you've mentioned ", the code calculates four losses and optimizes each of them separately, to train the generator, embedding network, discriminator and recovery network jointly. " However, in your code there are five modules, which includes the "supervisor" network. I am not quite sure about the functionality of the "supervisor".
The Time-Gan features the calculation of "capture the stepwise conditional distributions in the data" and the matching of them. Are these conditional distributions calculated by means of MSE and the RNN modules?(If so, How? If not, Who?) Thank you a lot!

benearnthof commented 1 year ago

Yes upon further inspection I forgot to remember the supervisor network as another auxiliary step that helps train both the embedding network & generator. I'd have to step through the code again to see where exactly it is invoked but we start using the supervisor when we initialize the embedding network.

The data is modeled with either RNNs, GRUs, or LSTMs, depending on what you specify int he TimeGAN module. You specify the rnn_type of the TimeGAN according to either of these three cases: https://github.com/benearnthof/TimeGAN/blob/0c8ab7133eb41369ce2b2815e07915fd8651e27f/modules_and_training.py#L29 The default is set to use GRUs.

BaeHann commented 1 year ago

Thank you very much!! I have the last question about Time-GAN. Please forgive my tediousness. Could you tell me about how the RNN, GRU or LSTM capture the stepwise conditional distributions in the data? Thanks a lot!

benearnthof commented 1 year ago

I think the architecture thats easiest to understand are RNNs, the other two are conceptually similar but have different implementations. The basic Idea is that you use the parameters in the RNN to calculate a projection of the inputs at time state 1 and then use the output you get from this as the input for the next step. This is the stepwise distribution, conditioned on the timestep before if you unroll the entire process. Check out the cheatsheet from stanford here:

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

That should help clarify the correspondence between using outputs from one timestep as the input to the next timestep and modeling the stepwise conditional distribution over time.

BaeHann commented 1 year ago

Thank you very much! (^▽^)

benearnthof / TimeGAN

Dear author, I am a little curious about the correspondence between the NIPS paper and the actual code #2