EperLuo / scDiffusion

A model developed for the generation of scRNA-seq data
MIT License
31 stars 4 forks source link

Reproducing Fig. 3 #5

Open JeanRadig opened 3 months ago

JeanRadig commented 3 months ago

Hi!

I am trying to reproduce Fig. 3, in particular the umap of muris with the different cell types.

Screenshot 2024-06-05 at 15 16 28

I have trained the models on the muris dataset. I try to get the conditional samples. I do not understand what parameters I should pass in the function main of classifier_sample.py to get the 12 samples, as needed in Jupyter notebook "script_diffusion_map", lines:


index2 = [0,1,2,3,4,5,6,7,8,9,10,11] for i in index2: npzfile=np.load(f'/data1/lep/Workspace/guided-diffusion/output/muriscondi/muris{i}_scimilarity.npz',allow_pickle=True)


Could you please let me know how you have done it?

JeanRadig commented 3 months ago
  1. Update train.sh. You can modify train.sh to accommodate for the right arguments.
cd VAE
echo "Training Autoencoder, this might take a long time"
CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir 'path/to/anndata.h5ad' --num_genes 18996 --save_dir 'path/to/saved_VAE_model' --max_steps 200000
echo "Training Autoencoder done"

cd ..
echo "Training diffusion backbone"
CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir 'path/to/anndata.h5ad'  --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt' \
    --save_dir 'path/to/saved_diffusion_model' --model_name 'my_diffusion' --save_interval 20000
echo "Training diffusion backbone done"

echo "Training classifier"
CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir 'path/to/anndata.h5ad' --model_path "path/to/saved_classifier_model" \
    --iterations 40000 --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt'
echo "Training classifier, done"
  1. Guide the user. You can insert some more explanation at the beginning of your notebooks to guide the user. Example on top of script_diffusion_umap.ipynb:

Before running the following code, ensure that you have trained the VAE model, the diffusion model, and the classifier model.

Given an AnnData file you wish to analyze, named anndata.h5ad, please follow these steps in your command line to create the necessary input data:

  1. Train VAE:

    • Prior to training, create the folder path/to/saved_VAE_model.
    • In the terminal, navigate to the VAE directory:
      cd VAE
    • Run the following command to train the Autoencoder:
      echo "Training Autoencoder, this might take a long time" 
      CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir 'path/to/anndata.h5ad' --num_genes 18996 --save_dir 'path/to/saved_VAE_model' --max_steps 200000
      echo "Training Autoencoder done"
  2. Train the Diffusion Model:

    • Prior to training, create the folder path/to/saved_diffusion_model.
    • In the terminal, navigate back to the root directory:
      cd ..
    • Run the following command to train the diffusion backbone:
      echo "Training diffusion backbone"
      CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir 'path/to/anndata.h5ad'  --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt' \
       --save_dir 'path/to/saved_diffusion_model' --model_name 'my_diffusion' --save_interval 20000
      echo "Training diffusion backbone done"
  3. Train the Classifier:

    • Prior to training, create the folder path/to/saved_classifier_model.
    • Run the following command to train the classifier:
      echo "Training classifier"
      CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir 'path/to/anndata.h5ad' --model_path "path/to/saved_classifier_model" \
       --iterations 40000 --vae_path 'path/to/saved_VAE_model/VAE_checkpoint.pt'
      echo "Training classifier, done"

Once the models are trained, you need to create the .npz files which will serve as input for the following steps. To do this:

  1. Unconditional Sampling (Generate Data from Diffusion Model):

    • Prior to sampling, create the folder path/to/saved_unconditional_sampling.
    • Run the following command to perform unconditional sampling:
      # Unconditional sampling
      python cell_sample.py --model_path "path/to/saved_diffusion_model/checkpoint.pt" --sample_dir "path/to/saved_unconditional_sampling"
  2. Conditional Sampling (Generate Data from Classifier Model):

    • Prior to sampling, create the folder path/to/saved_conditional_sampling.
    • You also need to modify the main() function in classifier_sample.py to create the samples based on your specified condition. NEEDS MORE DOCUMENTATION: PROVIDE CLEAR EXAMPLE
    • Run the following command to perform conditional sampling:
      # Conditional sampling 
      python classifier_sample.py --model_path "path/to/saved_diffusion_model/checkpoint.pt" --classifier_path "path/to/saved_classifier_model/checkpoint.pt" --sample_dir "path/to/saved_conditional_sampling"

Ensure that you replace 'path/to/anndata.h5ad', 'path/to/saved_VAE_model', 'path/to/saved_diffusion_model', and 'path/to/saved_classifier_model' with the actual paths in your system. Additionally, make sure to adjust any other parameters according to your specific setup and requirements.

JeanRadig commented 3 months ago

0001-Modified-train.sh-in-function-of-the-arguments-that-.patch

Sbs12 commented 3 months ago

Hello, can your unconditional generation effect achieve the effect described in the article?

EperLuo commented 3 months ago

Hi @JeanRadig, sorry for the late reply. This work is currently under revision. We will update the code repository to give a more comprehensive guidance for the user.

As for the conditional generation problem, you need to change the code in the main() to : `if name == "main":

for conditional generation

# main(cell_type=[2])
for type in range(12):#[0,1,2,3,4,5]:
    main(cell_type=[type])

to generate all 12 types of cells. And also change thesample_dir=f"output/simulated_samples/muris",in line 310 tosample_dir=f"output/simulatedsamples/muris{celltype[0]}",` to save them.

We would update these files these days. Thank you very much for those comments!

JeanRadig commented 3 months ago

@Sbs12, I do not manage to reproduce the results as of now. For what I have been trying lately, neither the conditional nor the unconditional data generation worked. In particular, the gradient interpolation yields batches of data that are uncorrelated to the true data, and undistinguishable from one another on a UMAP. But the authors are making updates so we need to stay tuned!

@EperLuo, thank you very much for your answer! I indeed managed to find out what to change. I also wanted to post here the results of my trial to reproduce the gradient interpolation, which I think is a very cool idea. I did not manage to obtain the desired results, could you tell me why it is so?

Using the following data as original:

Original data:

image

And the following as training data.

Training data (original minus D3 - D4.5):

image

We train the models.

Training steps

Training scimilarity

cd VAE
echo "Training Autoencoder, this might take a long time"
CUDA_VISIBLE_DEVICES=0 python VAE_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad' --num_genes 19423 --state_dict "/workspace/projects/001_scDiffusion/scripts/scDiffusion/annotation_model_v1" --save_dir '../checkpoint/AE/WOT_VAE' --max_steps 200000 --max_minutes 60
echo "Training Autoencoder done"

First, we need to do the following change on top of VAE_train.py to specify we are loading WOT data:

# from guided_diffusion.cell_datasets import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data
from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_muris import load_data

(note that when we were dealing with muris dataset it was muris).

Training diffusion

cd ..
echo "Training diffusion backbone"
CUDA_VISIBLE_DEVICES=0 python cell_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad'  --vae_path '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt' \
    --save_dir '/workspace/projects/001_scDiffusion/scripts/output/WOT_diffusion' --model_name 'my_diffusion' --save_interval 20000
echo "Training diffusion backbone done"

Where on top of cell_train.py we need to uncomment/comment as follows:

# In cell_train.py

from guided_diffusion import dist_util, logger
# from guided_diffusion.cell_datasets import load_data
 from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data
# from guided_diffusion.cell_datasets_muris import load_data
from guided_diffusion.resample import create_named_schedule_sampler

Training classifier

echo "Training classifier"
CUDA_VISIBLE_DEVICES=0 python classifier_train.py --data_dir '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/train_WOT.h5ad' --model_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_classifier" \
    --iterations 100000 --vae_path '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt'
echo "Training classifier, done"

Where on top of classifier_train.py we need to change following command:

# In classifier_train.py

from guided_diffusion import dist_util, logger
from guided_diffusion.fp16_util import MixedPrecisionTrainer
# from guided_diffusion.cell_datasets_pbmc import load_data
from guided_diffusion.cell_datasets_WOT import load_data
# from guided_diffusion.cell_datasets_muris import load_data
# from guided_diffusion.cell_datasets_sapiens import load_data

And change the number of classes in:

# In classifier_train.py, function create_argparser()

defaults.update(classifier_and_diffusion_defaults())
defaults['num_class']= 15  # <----- Change the number of classes to 15
parser = argparse.ArgumentParser()
add_dict_to_argparser(parser, defaults)
return parser

Creating latent spaces

Note that for scDiffusion, a few more arguments are needed, as can be found in classifier_sample.py:

# In classifier_sample.py 
# if gradient interpolation
ae_dir='output/Autoencoder_checkpoint/WOT/model_seed=0_step=150000.pt', 
num_gene=19423, # WOT 19423
init_time = 600,    # initial noised state if interpolation
init_cell_path = 'data/WOT/filted_data.h5ad',   #input initial noised cell state

sample_dir=f"output/simulated_samples/muris",
start_guide_steps = 500,     # the time to use classifier guidance
filter = False,   # filter the simulated cells that are classified into other condition, might take long time
# Conditional sampling 
python classifier_sample.py --model_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_diffusion/my_diffusion/model200000.pt" --classifier_path "/workspace/projects/001_scDiffusion/scripts/output/WOT_classifier/model099999.pt" --sample_dir "/workspace/projects/001_scDiffusion/scripts/output/WOT_simulated" --ae_dir '/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt' --init_cell_path '/workspace/projects/001_scDiffusion/data/data_in/data/WOT/filted_data.h5ad' --num_gene 19423 

Where we need to change the number of classes to here 15 (17 - 2, e.g. 3.5 and 4):

# In classifier_sample.py, funciton create_argparser()

defaults.update(model_and_diffusion_defaults())
defaults.update(classifier_and_diffusion_defaults())
defaults['num_class']=15 # <--------- Change the number of classes to 15
parser = argparse.ArgumentParser()
add_dict_to_argparser(parser, defaults)
return parser

We are going to simulate:

if __name__ == 'main':
    parser = create_argparser()
  args = parser.parse_args()
  save_dir = args.sample_dir
  for i in range(0,11):
      path_to_save = save_dir + f"{i}"
      to_save = main(cell_type=[6,7], inter=True, weight=[10-i,i])
      save_data(to_save, i, path_to_save)

We are simulating 11 samples between the cell types 6 (3) and 7 (4.5) of the filtered anndata.

Visualising results

Decoding the data necessitates to also change the num_genes to 19423 in:

def load_VAE():
    autoencoder = VAE(
        num_genes=19423,
        device='cuda',
        seed=0,
        loss_ae='mse',
        hidden_dim=128,
        decoder_activation='ReLU',
    )
    autoencoder.load_state_dict(torch.load('/workspace/projects/001_scDiffusion/scripts/scDiffusion/checkpoint/AE/WOT_VAE/model_seed=0_step=163856.pt'))
    return autoencoder

I obtain:

image

But they all overlap:

image

And they do not correlate with the training data:

image

What modifications should I do to obtain the desired results?

EperLuo commented 3 months ago

Hi, very glad you're so interested in our work! The training process seems correct, but to further determine the training and inference process, could you try generating the known cells (like all cell types from day 0 to day 8) and check if these cells can overlap with the real cells in the UMAP? This could help to confirm the proper usage of the model.

Besides, I do recommend you use the exp_script/script_diffusion_interpolation.ipynb to reproduce the results in the paper. The UMAP visual results might be different under different cell numbers. If the generated cells are way more than the real cells, the UMAP might be looked at as incorrelated. We set the number of generated cells in this script so that it is of the same order of magnitude as the number of real cells, and added real known cells to the decoder to balance out the effect of the layernorm in SCimilarity. I think this might help you to reproduce the results.