NUS-HPC-AI-Lab / Neural-Network-Parameter-Diffusion

We introduce a novel approach for parameter generation, named neural network parameter diffusion (p-diff), which employs a standard latent diffusion model to synthesize a new set of parameters
841 stars 44 forks source link

Parameter Autoencoder Cannot Converge #27

Open zhanglijun95 opened 1 week ago

zhanglijun95 commented 1 week ago

Hi Authors,

Thanks for your great work. Recently I'm trying to train the parameter autoencoder (ODEncoder2Decoder class in your module/modules/encoder.py) with my own parameter training set, which has 700 data points and the length of the parameter (in_dim) is about 900,000. The parameters are normalized before training. I faced the following issue.

  1. I first found that the autoencoder cannot really converge on my dataset. The train loss (MSE) is about 0.92, which is not bad. But when I pick up one parameter in my trainset and use the trained autoencoder to reconstruct it. The accuracy of the reconstructed parameters is bad. It's like the original parameter has an accuracy of 0.95, but the reconstructed one has 0.85, which is a lower bound in my task.

  2. Then I tried to train the autoencoder to do some sanity check. I pick up one parameter and try to overfit the autoencoder on it without the input noise and latent noise. However, it doesn't overfit on it. I also reduced the in_dim to 2048. It still doesn't overfit. I also tried to use MSE(reduction='sum'), but still cannot overfit.

  3. I also tried a simple autoencoder, which only has one linear layer in the encoder and one linear layer in the decoder. This autoencoder can successfully overfit on one 2048-dim parameter. I also tried larger dim (about 50,000), this simple autoencoder can overfit as well.

I'm wondering do you have any clue about why it cannot even overfit on one sample? How can I train a good autoencoder for the large parameter dimension? I also looked into the Cond P-diff GitHub repo since the Cond P-diff paper mentioned to train on 1,179,648-dim LoRA parameters, which is quite large. But it seems that Cond P-diff is using the same autoencoder as P-diff.

Any suggestion is really appreciated! Thank you!

Best, Lijun

MTDoven commented 1 week ago

We are sorry that you have encountered such a problem.

First, your parameter size is a bit large for pdiff. You need to make sure that your latent_dim is large enough to accommodate all this information. This is very important. For 900,000 parameters, according to our experience, latent_dim needs to be 2048 or more.

Second, you can try to remove the last tanh layer of the encoder and replace it with layernorm: nn.Tanh() --> nn.LayerNorm(latent_dim, elementwise_affine=False) Because in subsequent exploration, we found that that Tanh hinders the convergence of AE.

We have recently made some adjustments to the details and will update a more stable and easy-to-use version in the near future. You can try it on the new version at that time.

zhanglijun95 commented 1 week ago

Thank you for your kind response! I have a follow-up question regarding the first point. I also tried to overfit on one 2024-dim parameter, but it still cannot converge. Do you have any clue about it? And what do you think could be a good enc_channel_list, and dec_channel_list for me?

Really look forward to your new version! If possible, would you mind sharing the details of the new autoencoder structure a little bit? Cannot wait to try it in my codebase! Really appreciate your help.