Experiments on real bulk data

hathawayxxh commented 3 years ago

Hi Kevin,

I conducted experiments on real bulk data (by setting the train_datasets as 'data6k,data8k,donorA,donorC', and setting test_dataset as 'sdy67'). I noticed that the 'sdy67' dataset is already included in the "pbmc_data.h5ad" file. Should I directly use these data without preprocessing? I got unsatisfying result (which is much lower than the metrics you reported in the paper): Can you suggest what is wrong with my experiments? Thanks a lot.

Best, Xiaohan

KevinMenden commented 3 years ago

Hi @hathawayxxh ,

yes that doesn't seem right. Could you expand a little on what you have been doing? I would need the exact command you used to run Scaden and also how you calculated those output metrics.

You can use the pbmc_data.h5ad file also for evaluating on sdy67, as long as you don't specify this dataset as training dataset.

Cheers, Kevin

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks for your reply. When I run the Scaden, it always reports the following error:

So I rewrote this algorithm to a pytorch version. Experiments on the simulated data are correct (e.g., train_datasets = [data8k,donorA,donorC], test_dataset = [data6k].), resulting in similar results with your paper. However, the result is incorrect when I use all simulated data for training and use the sdy67 dataset (included in the pbmc_data.h5ad) for testing. In this case, I set the train_datasets = [data6k,data8k,donorA,donorC], test_dataset = [sdy67]. Other parts are kept unchanged with the simulation experiments.

KevinMenden commented 3 years ago

Hi Xiaiohan,

Okay, could you still send me the exact commands you used when trying to run the TF version of Scaden? Because there shouldn't be any error.

Nice that you implemented a pytorch version! Thought about doing this myself as I like it more than TF but stayed away from it for now. It's impossible for me to tell what the reason for that failing is if you used your own code, sorry. There could be a lot of things going wrong here. If you could point to your code I might have a look at it and see if I can find some issues.

Cheers, Kevin

hathawayxxh commented 3 years ago

Hi Kevin,

My command for simulation experiment is: scaden train "processed_for_donorC.h5ad" --train_datasets 'data6k,data8k,donorA' --steps 5000 --model_dir model

Then, it reports the error：

Best, Xiaohan

KevinMenden commented 3 years ago

Thanks - and the scaden process command? Also, did you use the pip version or the container? What system are you running on?

hathawayxxh commented 3 years ago

Hi Kiven,

The problem might be caused by the tf version. which version have you used in your experiments?

KevinMenden commented 3 years ago

In the initial implementation, Tensorflow 1.10.0 was used, but I have updated to Tensorflow 2 since. So everything from > v2.0.0 should work actually.

hathawayxxh commented 3 years ago

I found the former error was caused by tf version == 2.3.0:

Then I degraded the tf version to 2.0.0. The scaden runs correctly, but reported another error when saving models: ValueError: Model <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f0ec82b6400> cannot be saved because the input shapes have not been set. Usually, input shapes are automatically determined from calling .fit() or .predict(). To manually set the shapes, call model._set_inputs(inputs).

hathawayxxh commented 3 years ago

Hi Kevin,

I want to know did you perform any preprocessing before experiments on real bulk data? I have checked the genes of PBMC2 (GSE107011) cohort, and find that only 9547 genes overlap with the genes in the simulated data. Should I use preprocess to match the genes before training the network?

KevinMenden commented 3 years ago

Hi @hathawayxxh ,

yes please :) that's why I was asking you about the process steps you have performed.

If you don't run scaden process, it won't work. This steps assures that the genes overlap with the prediction data (among other things).

hathawayxxh commented 3 years ago

Yes, I understand the experiment with GSE107011 should perform preprocess, but the SDY67 dataset is already contained in the "pbmc_data.h5ad" file. I guess the genes in SDY67 is already matched with the simulated data, so I guess it does not need the preprocessing procedure?

KevinMenden commented 3 years ago

That's correct, however you should still pre-process it. The preprocessing also normalizes the count data, wich is essential. The data stored in this training dataset is not normalized.

hathawayxxh commented 3 years ago

Thanks, I used the GSE107011 to pre-process the file and get preprocessed.h5ad. scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/private_xiaohanxing/sc_deconv/bulk_data/GSE107011_exp.txt" In this process, all samples in the .h5ad file will be normalized (including the SDY67 file). However, using this file, the test performance on SDY67 is still unsatisfactory (CCC=0.15). Did I miss anything?

KevinMenden commented 3 years ago

GSE107011 is the Monaco et al. dataset. SDY67 is a different dataset. So I'm guessing you're labels are wrong?

hathawayxxh commented 3 years ago

The labels are correct.

KevinMenden commented 3 years ago

I'm sorry but you are saying the test performance is bad on SDY67 and at the same time you say you use GSE107011 for processing, which is clearly the wrong file for processing if you want to evaluate performance on SDY67. If you say the labels are correct then what are you using for processing?

KevinMenden commented 3 years ago

I'm getting pretty confused here. It would really help if you give me a proper description of what you have been doing/are doing, as I have asked for now a few times. That is:

what system are you running on
how did you install it
how did you prepare the data
what commands did you use
etc.

I'm assuming you are still running your pytorch version and I'm afraid I cannot help you with that. You should be able to reproduce the results with the Scaden version I have written here, but I need full details if it doesn't work in order to help you. Probably not today anymore though :-) But I'll have another look at this at the weekend.

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks for your help. I still cannot run your codes correctly. The detailed information is:

ubantu system, CUDA Version: 10.1. tf_version: 2.3.0. python: 3.6.8.
I installed the scaden by pip3 install scaden.
For the simulated data experiments, I prepared data by scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/test_data/donorC_500_samples.txt".
used the command scaden train "processed.h5ad" --train_datasets 'data6k,data8k,donorA' --steps 5000 --model_dir model
However, the model runs into an error: Then I degraded the tf version to 2.0.0. The scaden runs correctly, but reported another error when saving models: ValueError: Model <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f0ec82b6400> cannot be saved because the input shapes have not been set. Usually, input shapes are automatically determined from calling .fit() or .predict(). To manually set the shapes, call model._set_inputs(inputs).
For the real bulk data, I preprocessed the data by scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/private_xiaohanxing/sc_deconv/bulk_data/GSE107011_exp.txt"
then use the command scaden train "processed.h5ad" --train_datasets 'data6k,data8k,donorA,donorC' --steps 5000 --model_dir model. It runs into the same error as the simulated experiments.
Using the preprocessed data, I did experiments with my pytorch codes. On the simulated data, it gotcomparable performance as your codes. However, on the real bulk data, the performance is bad (SDY67 CCC = 0.15, GSE107011 CCC = 0.40).

so my question is:

do you know how to figure out the errors when I run your codes? I have searched a lot of information on the interenet. It seems like something wrong with the tf and cuda version. But I have tried many methods but failed to solve this problem.
maybe I should use the docker you provided? so it will not be affected by the local environment.
It is strange that my pytorch codes achieve similar performance with your codes on simulated data but much worse results on the real bulk data. I guess there might be something wrong with my data preprocessing. Could you give me a guidance on the commands for real data experiments?

Thanks a lot.

KevinMenden commented 3 years ago

Hi @hathawayxxh ,

thanks for the explanation, that clears up things quite a bit!

Those errors are weird and I will see if I can reproduce them. But I haven't encountered them before. Definitely try the docker image in the mean time. I will absolutely try and see if I can somehow repeat this and hopefully fix this for you. Might take a while though.

Regarding your pytorch implementation - yes it is weird that you get good performance on the simulated data but not on the bulk data. But it is impossible for me to tell you why this is the case without looking at the code and trying to reproduce it. There are countless points where it could have failed, from data pre-processing, training, or even some differences between pytorch and tensorflow (although that shouldn't make such a difference). The implementation might just be different. So I'm afraid I can't help you out on that one!

But I will look into the issues with my package.

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks for your reply and help.

Best, Xiaohan

hathawayxxh commented 3 years ago

Hi Kevin,

We re-run the bulk data experiments with the CPU docker that you provided. The procedures are :

download the training pbmc dataset for testing from https://figshare.com/s/e59a03885ec4c4d8153f in current workspace.
preprocess training data (pbmc_data.h5ad) and test data (GSE107011_exp.txt), delete some genes through the variance on the test data, and then normalize the log_min_max of the train data), and save the reserved gene_names in the model_dir/genes.txt file, so that you can only select the expression levels of these genes in the predict stage. This operation will save the preprocessed training data in processed.h5ad. scaden process "pbmc_data.h5ad" "GSE107011_exp.txt"
train the network with all simulated cohorts or simulated cohorts + PBMC1. scaden train "processed.h5ad" --train_datasets'data6k,data8k,donorA,donorC' --steps 5000 --model_dir model or scaden train "processed.h5ad" --train_datasets'data6k,data8k,donorA,donorC,sdy67' --steps 5000 --model_dir model
test on GSE dataset, and generate the prediction file 'scaden_predictions.txt' scaden predict --model_dir model GSE107011_exp.txt

Then, we computed the metrics for the predictions. For the model trained by simulated datasets, the CCC on PBMC2 dataset is 0.6038 (while the result in your paper is 0.68). For the model trained by simulated datasets and PBMC1, the CCC on PBMC2 dataset is 0.7273 (while the result in your paper is 0.86).

Could you help check the preprocess and training procedure? Is the training steps set as 5000 in your paper? Is there any procedures that I neglected?

Beside, the PBMC2 dataset I used in the experiments is attached. Could you help check whether this data file is the same as the data that you used in the paper? GSE107011_exp.txt

Thanks a lot and look forward for your reply.

Best, Xiaohan

KevinMenden commented 3 years ago

Hi Xiaohan,

that looks much better, although it still doesn't reach the performance we could get in the paper ... On first glance, this looks alright. Did you also include the monaco (PBMC2) dataset for prediction on PBMC1?

I'm pretty busy this week so I'll probably won't be able to look into this. Hopefully next week though!

Best, Kevin

KevinMenden commented 3 years ago

Ah sorry I misread you did only test on the PBMC2 (monaco) dataset and didn't get the matching results. Okay, I'll have a look at this when I find some time.

The results are not terribly off, so I'm currently assuming it might just be some slight pre-processing difference. Which doesn't mean that's ideal - it should be more robust. I'll have a look at the file you provided. Thanks!

KevinMenden commented 3 years ago

Hi, I had a quick look, it's not exactly the same, the file that I used had more genes in it: monaco_samples.txt

could you try with this? Make sure to modify nothing manually, just run scaden process pbmc_data.h5ad monaco_samples.txt

The results won't be exactly the same, I introduced code to make it fully reproducible only lately (slight variability during training and prediction is expected due to non-reproducible operations on the GPU and random keys), but it should be very close.

hathawayxxh commented 3 years ago

Hi Kevin,

Thanks a lot for your reply. I will try with this data.

Best, Xiaohan

KevinMenden / scaden

Experiments on real bulk data #91