How to repeat the experiments in your paper

hathawayxxh commented 3 years ago

Hi Kevin,

I am new to the scRNA deconvolution. I notice that there are many interesting experiments in your paper. However, according to your codes and the datasets you provided, I don't know how to repeat your experimental results (The current codes and webtool are more like tools to let people use on their own datasets). For example, the first experiment is leaving one PBMC dataset for validation and using the others as training data. I guess all of the four PBMC datasets are mixed in your provided pbmc_data.h5ad. How can I split the dataset to perform the experiments? Besides, for the real tissue datasets PBMC1 and PBMC2, can you provide the processed datasets used in your experiments? I don't know which file to download. I would appreciate it if you could give detailed instructions about how to reimplement your experiments and provide all the datasets used in your paper.

Thanks a lot for your time.

Best, Xiaohan

KevinMenden commented 3 years ago

Dear Xiaohan,

Scaden has a --training_datasets option, which you give a comma-separated list of the datasets. This way you can decide which datasets you want to use for training. This functionality was basically implemented for these experiments.

If you want to repeat these experiments, I would encourage you to start from the raw data. I did not do any special processing to the data (PBMC1 and PBMC2), other than merging some cell types together (described in the paper). If you want to generate some datasets yourself, you can also have a look at the processing scripts I provided on figshare: https://figshare.com/projects/Scaden/62834

Some of the datasets used in the study are under some sort of restricted access or I got them directly from the authors - so I can't just share them, sorry. I know some of the datasets are a bit tricky to find, so just let me know where exactly you are struggling and I might be able to help you out.

Cheers, Kevin

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks a lot for your reply. I have tried the option "scaden --training_datasets", but it reports the following error: I have downloaded the datasets from the webtool: but it is not clear about the train-test separation. For example, if I want to train the model using the "data6k, data8k, donorA", and test the model on the dataset "donorC". What kind of command should I use?

Best, Xiaohan

KevinMenden commented 3 years ago

Hi Xiaohan,

it should be usable when scalling scaden train. For instance when you call scaden train --help you get:

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

Usage: scaden train [OPTIONS] <training data>

  Train a Scaden model

Options:
  --train_datasets TEXT  Comma-separated list of datasets used for training.
                         Uses all by default.

  --model_dir TEXT       Path to store the model in
  --batch_size INTEGER   Batch size to use for training. Default: 128
  --learning_rate FLOAT  Learning rate used for training. Default: 0.0001
  --steps INTEGER        Number of training steps
  --seed INTEGER         Set random seed
  --help                 Show this message and exit.

You would indicate the datasets you want to use for training like you did and store the model with --model_dir Then you can use this model dir during prediction, using scaden predict --model_dir <your_model_dir>

Usage: scaden predict [OPTIONS] <prediction data>

  Predict cell type composition using a trained Scaden model

Options:
  --model_dir TEXT  Path to trained model
  --outname TEXT    Name of predictions file.
  --seed INTEGER    Set random seed
  --help            Show this message and exit.

Let me know if that works!

hathawayxxh commented 3 years ago

Hi Kevin,

I used the command "scaden train "./pbmc_data.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model", and it finally works. However, it reports another error:

Do you know how to solve it?

KevinMenden commented 3 years ago

Did you run scaden process before?

hathawayxxh commented 3 years ago

Is the following command correct? scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/"

It runs into another error:

KevinMenden commented 3 years ago

Note quite, you need to point it to the file (expression matrix) you want to run prediction on:

Usage: scaden process [OPTIONS] <training dataset to be processed> <data for
                      prediction>

  Process a dataset for training

Options:
  --processed_path TEXT  Path of processed file. Must end with .h5ad
  --var_cutoff FLOAT     Filter out genes with a variance less than the
                         specified cutoff. A low cutoff is recommended,this
                         should only remove genes that are obviously
                         uninformative.

  --help                 Show this message and exit.

Have a look at the demo with example data simulation that I provide in the README.md for all the steps you need to do to perform training and prediction. It also generates example data which you can inspect to check if your data is formatted correctly:

https://github.com/KevinMenden/scaden/blob/master/README.md

hathawayxxh commented 3 years ago

Hi Kevin,

In my understanding, the "data simulation" part is only useful when I need to generate a new training dataset. Is it right? If I want to train with 'data6k, data8k, donorA' and test with 'donorC'. I think all of these datasets are contained in the 'pbmc_data.h5ad' file. So should I use "scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad"? I am a bit confusing.

KevinMenden commented 3 years ago

Ahh I see. No, you need a text file containing the gene expression. Have a look at the dataset I shared earlier and download this one: https://figshare.com/articles/software/Publication_Figures/8234030

For the simulated PBMC data there are some expression matrices inside. Specifically: paper_data_v3/figures/figure2/data

hathawayxxh commented 3 years ago

Hi Kevin,

It finally works. Thanks a lot for your help. I will reimplement the experiments and compare them with the results in your paper.

Best, Xiaohan

KevinMenden commented 3 years ago

Perfect, let me know if you encounter any other issues! :)

hathawayxxh commented 3 years ago

Hi Kevin,

sorry for bothering you again. I have trained a model with "data6k, data8k, donorA" and tested on "donorC_500_samples.txt". However, the result is quite different from your paper. The following is my result on the donorC dataset: my_donorC_metric

My operations are:

preprocess data: scaden process "./pbmc_data.h5ad" "./donorC_500_samples.txt"
train model: scaden train "processed.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model
test model: scaden predict --model_dir model donorC_500_samples.txt

Is there anything wrong with my commands? Maybe the training data and testing data are not matched in their distributions?

Looking forward to your reply.

Best, Xiaohan

KevinMenden commented 3 years ago

Dear Xiahan,

nice that you could get it running. The results look pretty normal on first glance, but you're right that the RMSE should probably be a bit lower and the CCC higher. How did you calculate those values? It would be nice to also calculate for all data points, and not by cell type (which is the main metric we used). And maybe run it for another of those datasets. I'll replicate those steps myself when I get to it to make sure nothing is wrong with the training dataset. But not sure if I will get around to do this in the next days.

Best, Kevin

KevinMenden commented 3 years ago

Could you show me the output of Scaden after you type scaden train ... ? It usually tells you which datasets where used for training. I have a feeling that it didn't use all the datasets but just one as you didn't supplied a comma-separated list but one that also has white spaces. So if you could try again using 'data6k,data8k,donorA' instead of 'data6k, data8k, donorA'

that would be great :) Let me know if that helps!

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks for your reply. The information looks correct. It seems the model is trained on the three datasets. But I will try to remove the white spaces as you suggested.

For the calculation of the metrics, I used the code you provided scaden_paper_data_v3\figures\figure2\fig2_comparison_plots.ipynb I also used this code to evaluate the results you provided at scaden_paper_data_v3\figures\figure2\scaden_predictions\scaden_predictions_donorC.txt and get the metrics very close to your paper: paper_donorC_metric Thus, I think the metric computation is alright. I will do experiments on other datasets.

Best, Xiaohan

hathawayxxh commented 3 years ago

Yesterday, I didn't know why this error was solved, but today it comes again. According to the solutions on the internet, it is about the usage of GPU memory (My GPU has 32G memory). I have tried several methods but did not solve the problem. Can you help check the codes to see whether there is something that occupies the GPU memory? Thanks a lot.

KevinMenden commented 3 years ago

Hi, if you have a look at the "training on" message, it has appended those white spaces to the datasets and thus probably didn't use them for training. So it would be good to check again with the proper dataset description.

Sorry I have never encountered that error before and I have been testing Scaden a lot with a 6 GB GPU ... there really isn't anything special in the code that could lead to this (not that I can think of at least).

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks for your reply. Now the problems are solved and I can get results very close to those reported in your paper.

Thanks a lot.

Best, Xiaohan

KevinMenden commented 3 years ago

Awesome!

I already made a new issue to remind me that I should add a warning if datasets are supplied which are not part of the training datasets. That's very easy to miss ....

Best, Kevin

KevinMenden / scaden

How to repeat the experiments in your paper #77