Open JeanRadig opened 3 months ago
cd VAE
echo "Training Autoencoder, this might take a long time"
CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir 'path/to/anndata.h5ad' --num_genes 18996 --save_dir 'path/to/saved_VAE_model' --max_steps 200000
echo "Training Autoencoder done"
cd ..
echo "Training diffusion backbone"
CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir 'path/to/anndata.h5ad' --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt' \
--save_dir 'path/to/saved_diffusion_model' --model_name 'my_diffusion' --save_interval 20000
echo "Training diffusion backbone done"
echo "Training classifier"
CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir 'path/to/anndata.h5ad' --model_path "path/to/saved_classifier_model" \
--iterations 40000 --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt'
echo "Training classifier, done"
Before running the following code, ensure that you have trained the VAE model, the diffusion model, and the classifier model.
Given an AnnData file you wish to analyze, named anndata.h5ad
, please follow these steps in your command line to create the necessary input data:
Train VAE:
path/to/saved_VAE_model
.VAE
directory:
cd VAE
echo "Training Autoencoder, this might take a long time"
CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir 'path/to/anndata.h5ad' --num_genes 18996 --save_dir 'path/to/saved_VAE_model' --max_steps 200000
echo "Training Autoencoder done"
Train the Diffusion Model:
path/to/saved_diffusion_model
.cd ..
echo "Training diffusion backbone"
CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir 'path/to/anndata.h5ad' --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt' \
--save_dir 'path/to/saved_diffusion_model' --model_name 'my_diffusion' --save_interval 20000
echo "Training diffusion backbone done"
Train the Classifier:
path/to/saved_classifier_model
.echo "Training classifier"
CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir 'path/to/anndata.h5ad' --model_path "path/to/saved_classifier_model" \
--iterations 40000 --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt'
echo "Training classifier, done"
Once the models are trained, you need to create the .npz
files which will serve as input for the following steps. To do this:
Unconditional Sampling (Generate Data from Diffusion Model):
path/to/saved_unconditional_sampling
.# Unconditional sampling
python cell_sample.py --model_path "path/to/saved_diffusion_model/checkpoint.pt" --sample_dir "path/to/saved_unconditional_sampling"
Conditional Sampling (Generate Data from Classifier Model):
path/to/saved_conditional_sampling
.main()
function in classifier_sample.py
to create the samples based on your specified condition. NEEDS MORE DOCUMENTATION: PROVIDE CLEAR EXAMPLE# Conditional sampling
python classifier_sample.py --model_path "path/to/saved_diffusion_model/checkpoint.pt" --classifier_path "path/to/saved_classifier_model/checkpoint.pt" --sample_dir "path/to/saved_conditional_sampling"
Ensure that you replace 'path/to/anndata.h5ad'
, 'path/to/saved_VAE_model'
, 'path/to/saved_diffusion_model'
, and 'path/to/saved_classifier_model'
with the actual paths in your system. Additionally, make sure to adjust any other parameters according to your specific setup and requirements.
Hello, can your unconditional generation effect achieve the effect described in the article?
Hi @JeanRadig, sorry for the late reply. This work is currently under revision. We will update the code repository to give a more comprehensive guidance for the user.
As for the conditional generation problem, you need to change the code in the main() to : `if name == "main":
# main(cell_type=[2])
for type in range(12):#[0,1,2,3,4,5]:
main(cell_type=[type])
to generate all 12 types of cells. And also change the
sample_dir=f"output/simulated_samples/muris",in line 310 to
sample_dir=f"output/simulatedsamples/muris{celltype[0]}",` to save them.
We would update these files these days. Thank you very much for those comments!
@Sbs12, I do not manage to reproduce the results as of now. For what I have been trying lately, neither the conditional nor the unconditional data generation worked. In particular, the gradient interpolation yields batches of data that are uncorrelated to the true data, and undistinguishable from one another on a UMAP. But the authors are making updates so we need to stay tuned!
@EperLuo, thank you very much for your answer! I indeed managed to find out what to change. I also wanted to post here the results of my trial to reproduce the gradient interpolation, which I think is a very cool idea. I did not manage to obtain the desired results, could you tell me why it is so?
Using the following data as original:
And the following as training data.
We train the models.
cd VAE
echo "Training Autoencoder, this might take a long time"
CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad' --num_genes 19423 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/WOT_VAE' --max_steps 200000 --max_minutes 60
echo "Training Autoencoder done"
First, we need to do the following change on top of VAE_train.py to specify we are loading WOT data:
# from guided_diffusion.cell_datasets import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data
from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_muris import load_data
(note that when we were dealing with muris dataset it was muris).
cd ..
echo "Training diffusion backbone"
CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad' --vae_path '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt' \
--save_dir '/workspace/projects/001_scDiffusion/scripts/output/WOT_diffusion' --model_name 'my_diffusion' --save_interval 20000
echo "Training diffusion backbone done"
Where on top of cell_train.py we need to uncomment/comment as follows:
# In cell_train.py
from guided_diffusion import dist_util, logger
# from guided_diffusion.cell_datasets import load_data
from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data
# from guided_diffusion.cell_datasets_muris import load_data
from guided_diffusion.resample import create_named_schedule_sampler
echo "Training classifier"
CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad' --model_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_classifier" \
--iterations 100000 --vae_path '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt'
echo "Training classifier, done"
Where on top of classifier_train.py we need to change following command:
# In classifier_train.py
from guided_diffusion import dist_util, logger
from guided_diffusion.fp16_util import MixedPrecisionTrainer
# from guided_diffusion.cell_datasets_pbmc import load_data
from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_muris import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data
And change the number of classes in:
# In classifier_train.py, function create_argparser()
defaults.update(classifier_and_diffusion_defaults())
defaults['num_class']= 15 # <----- Change the number of classes to 15
parser = argparse.ArgumentParser()
add_dict_to_argparser(parser, defaults)
return parser
Note that for scDiffusion, a few more arguments are needed, as can be found in classifier_sample.py:
# In classifier_sample.py
# if gradient interpolation
ae_dir='output/Autoencoder_checkpoint/WOT/model_seed=0_step=150000.pt',
num_gene=19423, # WOT 19423
init_time = 600, # initial noised state if interpolation
init_cell_path = 'data/WOT/filted_data.h5ad', #input initial noised cell state
sample_dir=f"output/simulated_samples/muris",
start_guide_steps = 500, # the time to use classifier guidance
filter = False, # filter the simulated cells that are classified into other condition, might take long time
# Conditional sampling
python classifier_sample.py --model_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_diffusion/my_diffusion/model200000.pt" --classifier_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_classifier/model099999.pt" --sample_dir "/workspace/projects/001_scDiffusion/scripts/output/WOT_simulated" --ae_dir '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt' --init_cell_path '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/filted_data.h5ad' --num_gene 19423
Where we need to change the number of classes to here 15 (17 - 2, e.g. 3.5 and 4):
# In classifier_sample.py, funciton create_argparser()
defaults.update(model_and_diffusion_defaults())
defaults.update(classifier_and_diffusion_defaults())
defaults['num_class']=15 # <--------- Change the number of classes to 15
parser = argparse.ArgumentParser()
add_dict_to_argparser(parser, defaults)
return parser
We are going to simulate:
if __name__ == 'main':
parser = create_argparser()
args = parser.parse_args()
save_dir = args.sample_dir
for i in range(0,11):
path_to_save = save_dir + f"{i}"
to_save = main(cell_type=[6,7], inter=True, weight=[10-i,i])
save_data(to_save, i, path_to_save)
We are simulating 11 samples between the cell types 6 (3) and 7 (4.5) of the filtered anndata.
Decoding the data necessitates to also change the num_genes to 19423 in:
def load_VAE():
autoencoder = VAE(
num_genes=19423,
device='cuda',
seed=0,
loss_ae='mse',
hidden_dim=128,
decoder_activation='ReLU',
)
autoencoder.load_state_dict(torch.load('/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt'))
return autoencoder
I obtain:
But they all overlap:
And they do not correlate with the training data:
Hi, very glad you're so interested in our work! The training process seems correct, but to further determine the training and inference process, could you try generating the known cells (like all cell types from day 0 to day 8) and check if these cells can overlap with the real cells in the UMAP? This could help to confirm the proper usage of the model.
Besides, I do recommend you use the exp_script/script_diffusion_interpolation.ipynb to reproduce the results in the paper. The UMAP visual results might be different under different cell numbers. If the generated cells are way more than the real cells, the UMAP might be looked at as incorrelated. We set the number of generated cells in this script so that it is of the same order of magnitude as the number of real cells, and added real known cells to the decoder to balance out the effect of the layernorm in SCimilarity. I think this might help you to reproduce the results.
Hi!
I am trying to reproduce Fig. 3, in particular the umap of muris with the different cell types.
I have trained the models on the muris dataset. I try to get the conditional samples. I do not understand what parameters I should pass in the function main of classifier_sample.py to get the 12 samples, as needed in Jupyter notebook "script_diffusion_map", lines:
index2 = [0,1,2,3,4,5,6,7,8,9,10,11] for i in index2: npzfile=np.load(f'/data1/lep/Workspace/guided-diffusion/output/muriscondi/muris{i}_scimilarity.npz',allow_pickle=True)
Could you please let me know how you have done it?