Open dnguyen1196 opened 4 years ago
hmmm this looks properly weird, some worse than tossing a coin. I'll dig into the code later tonight or tomorrow. But first, let's brainstorm some sanity check experiment. For example, if you have a small training set you should be able to overfit it.
also what does the output graph look like?
from the 0.75 and plateau curve I was suspecting maybe there is some sort of leak. like you said if you always do carbon you can hit 0.75 so did the model do this?
another experiment is to turn off the KL term and just have the reconstruction loss. in that case it would really be easy to reproduce the input.
Thanks, will try what you suggetsed.
also what does the output graph look like?
I didn't directly draw/plot the predicted graph but it does seem like the model is predicting a mix of both positive (present) and negative (absent) edges.
from the 0.75 and plateau curve I was suspecting maybe there is some sort of leak. like you said if you always do carbon you can hit 0.75 so did the model do this?
Yeah, I did a few experiments where I print out the actual predicted labels, and they're all 6 (Carbon)
Generative model
Generative model implementation: https://github.com/choderalab/pinot/blob/master/pinot/generative/torch_gvae/model.py
Encoder: One input graph g with node features. For each node feature:
hidden_dim1
-dimension vectorhidden_dim2
vectorhidden_dim3
vector. This will outputmu
, the mean of the approximate posterior distribution over latent node representations. The other maps the output from 2 on ahidden_dim3
vector. This will outputlogvar
, the log variance of the approximate posterior distribution over latent node representations.Decoder: Two separate decoder is used:
A'
. "Soft" here means thatA'_ij
is the predicted probability that there is an edge between nodes i and j. It uses the output from 3. It computesA'_ij = z_i^T z_j
num_atom_types
vector and apply soft max. This output is used for node type classification.Loss function is the negative ELBO. It is composed of the expected log likelihood term and the KL divergence. The KL term is as usual. The log likelihood term is composed of two terms. One term is
binary_cross_entropy
between the true adjaceny matrix and the output of 4. The other isbinary_cross_entropy
between the true node class (in 1 hot-vector) and the output of 5.Loss function: https://github.com/choderalab/pinot/blob/master/pinot/generative/torch_gvae/loss.py
Data
For these experiments, I used
esol
which has about 1100 molecules. I did a 0.9 training and 0.1 testing split.Metrics
Experiment: https://github.com/choderalab/pinot/blob/master/scripts/generative/gvae_exp.py
Right now, the metrics I have implemented and used are: true positive rate for edge prediction, true negative rate for edge prediction and accuracy for node classification.
Hyper-parameters
The hyper-parameters I focused on when I did these experiments are:
hidden_dim1
,hidden_dim2
,hidden_dim3
, number of epochs and batch size. Not really knowing a good place to start so I tried a large combination of hidden dimensions where each hidden dimension is one of[256, 128, 64]
. Batch size is one of10, 25, 50, 100
. The number of training epochs is 100 or 200.Step-size: 0.001
Some observations
[256, 256, 256]
, batch size100
and200
epochs.The true positive and true negative rate for edge prediction is about 0.5. So the model does predict some edges as present and some as absent but is just really bad at it. The accuracy for node classification is about 0.75 (It is because the model always predicts the atom type is Carbon)
For example, this is for
hidden_dimensions =[64,64,64]
withbatch_size=25
andn_epochs=100
And this result is for
hidden_dimensions=[64,256,64]
, andbatch_size=10
andn_epochs=200
Any suggestions? I think the most concerning thing is the model only predicts/outputs one type of atom. However, I'm not sure how to approach investigating this more. I can start by looking at the type of samples are drawn in step 3. Let me know if some of the steps don't make sense/ are wrong too.
Update 5/29/2020
After talking with Yuanqing and doing some further experiments, we came across some surprising things.
Firstly, as a follow up to the experiments described above for the generative model, we did further experiments with the loss function. We wanted to see why the edge prediction accuracy is so low and why the sampled nodes' types are all Carbon. We experimented with a loss function that does not have the KL term in the ELBO. This loss function only has the two cross-entropy terms associated with nodes and edges prediction. We observe that the results can vary substantially. Sometimes, we would get high accuracy (~90%) for edge prediction (both test and train) and low accuracy (~10%) for node prediction. At other times, we would get roughly 50% accuracy for both edge prediction and node prediction. When we reintroduce the KL term, we of course get the sort of results outlined previously. The edge prediction accuracy is around 50% and the node prediction accuracy is around 75%.
We suspected that there was a bug in our generative model implementation. Therefore, we experimented with a very simple model that is not even auto-encoder but simply a 2 layer neural network. Both layers are linear layers. The first layer's input dimension is 117 (the feature dimension) and its output dimension is 64. The second layer's output dimension is 100 (used for node type classification). We ran this for 200 epochs, using Adam optimizer with step size 0.001. We used
cross_entropy
for the loss. We observed that the training node prediction accuracy reaches a maximum of around 75% before decreasing to around 60% as the learning algorithm converges. When we print the actual node types being generated, we at least see that the generative model produces some diversity in node types (not all are 6 / Carbon).These surprising results imply that the results for node accuracy of the generative model might be expected given the choice of the loss function.