EperLuo / scDiffusion

A model developed for the generation of scRNA-seq data
MIT License
37 stars 6 forks source link

Gene mapping between user-provided scRNA and genes from SCimilarity during vae training #10

Open humengying0907 opened 1 week ago

humengying0907 commented 1 week ago

When using the pre-trained weight from SCimilarity, how does VAE_train.py account for different genes between user provided adata and SCimilarity? What if a gene is present in user's data but not in SCimilarity? Or what if a gene is present in SCimilarity but not in user's data?

There is indeed a num_genes parameter from VAE_train.py to control the dimension of VAE, such that it fits user provided scRNA, but I am not seeing it having any control over the gene order. When trying to reproduce vae training step from this command:

CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/tabula_muris/all.h5ad' --num_genes 18996 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/my_VAE' --max_steps 200000 --max_minutes 600

I got these loss reports, which are always around 0.04:

step 0 loss 0.21746787428855896 step 1000 loss 0.04769279062747955 step 2000 loss 0.048065099865198135 step 3000 loss 0.04667588323354721 step 4000 loss 0.045960813760757446

Could you please provide some clarification or possible solution to this? Thank you so much!

EperLuo commented 1 week ago

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

humengying0907 commented 1 week ago

Hi! The num_genes parameter is used to control the shape of the first layer of encoder and the last layer of decoder, which you can see from the VAE_model.py line 52. This means that the first layer of the encoder would not use the weight from SCimilarity (so does the last layer of the decoder) and the size would be the same as the user's input gene set.

The loss seems to drop normally, have you tried finishing the training process and seeing the reconstruction result?

Thanks. Yes, I have tried both pre-trained model from SCimilarity and VAE without pre-trained model, my VAE loss is always around 0.04, even near the end of the training

step 194000 loss 0.042236313223838806 step 195000 loss 0.03814302757382393 step 196000 loss 0.043102916330099106 step 197000 loss 0.04345700144767761 step 198000 loss 0.042990997433662415 step 199000 loss 0.039488837122917175

Is this actually expected for the WOT data?

Additionally, what classifier is considered to be "good enough"? For the WOT data, the training accuracy at the end of the training process is 0.164, which is far from being perfect. Is this expected as well? What train_acc should we expect in general?

| grad_norm | 0.544 | | param_norm | 101 | | samples | 1.28e+07 | | step | 9.99e+04 | | train_acc@1 | 0.164 | | train_acc@1_q0 | 0 | | train_acc@1_q1 | 0 | | train_loss | 2.42 | | train_loss_q0 | 3.6 | | train_loss_q1 | 2.59 |

Thank you so much!

EperLuo commented 1 week ago

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

humengying0907 commented 1 week ago

The training loss looks normal. This might indicate that the model has reached its convergence. If you still have concerns about that, you can try to use the model you trained to reconstruct the data and see if the reconstructed data matches the training data.

As for the classifier, since the training samples are noised, the training accuracy might be very low, which is in line with expected. The accuracy of 0.164 should be normal, but I'm sorry that I can't give you specific expectations. I think it's good enough so long as it is obviously higher than random choice.

Thank you! I have another general question about the VAE model. Technically, it is just an autoencoder, not VAE, since we are only optimizing reconstruction loss but not kl divergence. Is there any advantage of using autoencoder over VAE? Given that we don't heavily rely on the pre-trained weight from SCimilarity.

Additionally, given that we are using the entire dataset for training, how can we avoid overfitting?

Thank you!

EperLuo commented 6 days ago

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

humengying0907 commented 4 days ago

For the first question, since the diffusion model has no strict requirements for the distribution of the input training data, there is no need to constrain the feature in the hidden space using KL loss or other variational loss, and the original distribution of the encoder output is just fine. Though we didn't put an experiment to prove that, I do think the extra loss in the latent space would make reconstruction less effective.

As for the overfitting problem, if the target is to augment the existing data, overfitting might not be a serious problem. But for the out of distribution data generation, there indeed needs a validation set to justify that (which we didn't do, cause we didn't observe overfitting). You can split a part of the dataset (totally apart from the rest) to validate the model.

Thank you so much!