JeffersonLab / jlab_datascience_exp_hall

MIT License
0 stars 1 forks source link

Bug with train_step inheritance #12

Open sgoldenCS opened 1 week ago

sgoldenCS commented 1 week ago

Trevor noticed that calling train() using the outer GAN model was calling the train_step() from the inner GAN.

We need to override the train_step of the models inside the TF_CGAN, but I directly inherited TF_CGAN. Since TF_CGAN isn't a Keras model, it has no train_step, so my code isn't "overriding" anything. I think in order for this to work, we need to be able to change what model type is used in TF_CGAN. This issue will attempt to fix this bug. To make merging the changes easier for Trevor, this branch is based off of #10 instead of the main branch.

sgoldenCS commented 1 week ago

Quick Progress Update

I have made changes to the inheritance that properly call the outerGAN train_step(), however this has caused some new bugs. Specifically, calling the predict() method on the saved unfolding TF_CGAN innerGAN model fails with the following error:

  File "/Users/sgolden/Documents/GitHub/jlab_datascience_exp_hall/Hall_B/AIDAPT/aidapt_toolkit/models/tf_cgan_v0.py", line 607, in predict
    return self.generator.predict([data, noise], batch_size=1024)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

In order to cause the outerGAN train_step() to override the standard one, it now inherits from TF_CGAN_Keras (the real Keras model) instead of TF_CGAN (the workflow "model"). This means that it is no longer a "workflow model". This required a few changes:

To Dos:

tgreed86 commented 1 day ago

I think I've found what's causing the error with the "predict" function. It appears that this built-in, Keras function is only meant to be used in "eager" mode. Inside train_step, tensorflow is running in "graph" mode, and this results in the error. I resolved this by creating a new "predict" function that simply calls "self.generator". However, this leads to other problems with the shapes of the input tensors to the generator. This is due to the fact that the datasets that the models are trained on have 5 variables: sppim, spipm, tpip, alpha, and s. All five are passed to the inner GAN generator, but the "s" variable is dropped. Then in the end, only the other four variables are returned from the generator. But the inner GAN generator still expects the 5 variable dataset. Since the sequence of procedures is:

Outer GAN Generator $\rightarrow$ Inner GAN Generator

the inner GAN generator receives input of shape (None, 4), but it expects shape (None, 5). To resolve this, I'm going to add another "build_generator" method in tf_cgan_v0.py to handle the truncated dataset. This is likely just a temporary fix. We can determine the best way to approach this going forward.