Closed hathawayxxh closed 2 years ago
Hi @hathawayxxh ,
yes that doesn't seem right. Could you expand a little on what you have been doing? I would need the exact command you used to run Scaden and also how you calculated those output metrics.
You can use the pbmc_data.h5ad
file also for evaluating on sdy67
, as long as you don't specify this dataset as training dataset.
Cheers, Kevin
Hi Kevin,
Thanks for your reply. When I run the Scaden, it always reports the following error:
So I rewrote this algorithm to a pytorch version. Experiments on the simulated data are correct (e.g., train_datasets = [data8k,donorA,donorC], test_dataset = [data6k].), resulting in similar results with your paper. However, the result is incorrect when I use all simulated data for training and use the sdy67 dataset (included in the pbmc_data.h5ad) for testing. In this case, I set the train_datasets = [data6k,data8k,donorA,donorC], test_dataset = [sdy67]. Other parts are kept unchanged with the simulation experiments.
Hi Xiaiohan,
Okay, could you still send me the exact commands you used when trying to run the TF version of Scaden? Because there shouldn't be any error.
Nice that you implemented a pytorch version! Thought about doing this myself as I like it more than TF but stayed away from it for now. It's impossible for me to tell what the reason for that failing is if you used your own code, sorry. There could be a lot of things going wrong here. If you could point to your code I might have a look at it and see if I can find some issues.
Cheers, Kevin
Hi Kevin,
My command for simulation experiment is: scaden train "processed_for_donorC.h5ad" --train_datasets 'data6k,data8k,donorA' --steps 5000 --model_dir model
Then, it reports the error:
Best, Xiaohan
Thanks - and the scaden process
command?
Also, did you use the pip version or the container? What system are you running on?
Hi Kiven,
The problem might be caused by the tf version. which version have you used in your experiments?
In the initial implementation, Tensorflow 1.10.0 was used, but I have updated to Tensorflow 2 since. So everything from > v2.0.0 should work actually.
I found the former error was caused by tf version == 2.3.0:
Then I degraded the tf version to 2.0.0. The scaden runs correctly, but reported another error when saving models: ValueError: Model <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f0ec82b6400> cannot be saved because the input shapes have not been set. Usually, input shapes are automatically determined from calling .fit() or .predict(). To manually set the shapes, call model._set_inputs(inputs).
Hi Kevin,
I want to know did you perform any preprocessing before experiments on real bulk data? I have checked the genes of PBMC2 (GSE107011) cohort, and find that only 9547 genes overlap with the genes in the simulated data. Should I use preprocess to match the genes before training the network?
Hi @hathawayxxh ,
yes please :) that's why I was asking you about the process steps you have performed.
If you don't run scaden process
, it won't work. This steps assures that the genes overlap with the prediction data (among other things).
Yes, I understand the experiment with GSE107011 should perform preprocess, but the SDY67 dataset is already contained in the "pbmc_data.h5ad" file. I guess the genes in SDY67 is already matched with the simulated data, so I guess it does not need the preprocessing procedure?
That's correct, however you should still pre-process it. The preprocessing also normalizes the count data, wich is essential. The data stored in this training dataset is not normalized.
Thanks, I used the GSE107011 to pre-process the file and get preprocessed.h5ad.
scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/private_xiaohanxing/sc_deconv/bulk_data/GSE107011_exp.txt"
In this process, all samples in the .h5ad file will be normalized (including the SDY67 file). However, using this file, the test performance on SDY67 is still unsatisfactory (CCC=0.15).
Did I miss anything?
GSE107011 is the Monaco et al. dataset. SDY67 is a different dataset. So I'm guessing you're labels are wrong?
The labels are correct.
I'm sorry but you are saying the test performance is bad on SDY67 and at the same time you say you use GSE107011 for processing, which is clearly the wrong file for processing if you want to evaluate performance on SDY67. If you say the labels are correct then what are you using for processing?
I'm getting pretty confused here. It would really help if you give me a proper description of what you have been doing/are doing, as I have asked for now a few times. That is:
I'm assuming you are still running your pytorch version and I'm afraid I cannot help you with that. You should be able to reproduce the results with the Scaden version I have written here, but I need full details if it doesn't work in order to help you. Probably not today anymore though :-) But I'll have another look at this at the weekend.
Hi Kevin,
Thanks for your help. I still cannot run your codes correctly. The detailed information is:
ubantu system, CUDA Version: 10.1. tf_version: 2.3.0. python: 3.6.8.
I installed the scaden by pip3 install scaden
.
For the simulated data experiments, I prepared data by scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/test_data/donorC_500_samples.txt"
.
used the command scaden train "processed.h5ad" --train_datasets 'data6k,data8k,donorA' --steps 5000 --model_dir model
However, the model runs into an error: Then I degraded the tf version to 2.0.0. The scaden runs correctly, but reported another error when saving models: ValueError: Model <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f0ec82b6400> cannot be saved because the input shapes have not been set. Usually, input shapes are automatically determined from calling .fit() or .predict(). To manually set the shapes, call model._set_inputs(inputs).
For the real bulk data, I preprocessed the data by scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/private_xiaohanxing/sc_deconv/bulk_data/GSE107011_exp.txt"
then use the command scaden train "processed.h5ad" --train_datasets 'data6k,data8k,donorA,donorC' --steps 5000 --model_dir model
. It runs into the same error as the simulated experiments.
Using the preprocessed data, I did experiments with my pytorch codes. On the simulated data, it gotcomparable performance as your codes. However, on the real bulk data, the performance is bad (SDY67 CCC = 0.15, GSE107011 CCC = 0.40).
so my question is:
Thanks a lot.
Hi @hathawayxxh ,
thanks for the explanation, that clears up things quite a bit!
Those errors are weird and I will see if I can reproduce them. But I haven't encountered them before. Definitely try the docker image in the mean time. I will absolutely try and see if I can somehow repeat this and hopefully fix this for you. Might take a while though.
Regarding your pytorch implementation - yes it is weird that you get good performance on the simulated data but not on the bulk data. But it is impossible for me to tell you why this is the case without looking at the code and trying to reproduce it. There are countless points where it could have failed, from data pre-processing, training, or even some differences between pytorch and tensorflow (although that shouldn't make such a difference). The implementation might just be different. So I'm afraid I can't help you out on that one!
But I will look into the issues with my package.
Hi Kevin,
Thanks for your reply and help.
Best, Xiaohan
Hi Kevin,
We re-run the bulk data experiments with the CPU docker that you provided. The procedures are :
scaden process "pbmc_data.h5ad" "GSE107011_exp.txt"
scaden train "processed.h5ad" --train_datasets'data6k,data8k,donorA,donorC' --steps 5000 --model_dir model
or scaden train "processed.h5ad" --train_datasets'data6k,data8k,donorA,donorC,sdy67' --steps 5000 --model_dir model
scaden predict --model_dir model GSE107011_exp.txt
Then, we computed the metrics for the predictions. For the model trained by simulated datasets, the CCC on PBMC2 dataset is 0.6038 (while the result in your paper is 0.68). For the model trained by simulated datasets and PBMC1, the CCC on PBMC2 dataset is 0.7273 (while the result in your paper is 0.86).
Could you help check the preprocess and training procedure? Is the training steps set as 5000 in your paper? Is there any procedures that I neglected?
Beside, the PBMC2 dataset I used in the experiments is attached. Could you help check whether this data file is the same as the data that you used in the paper? GSE107011_exp.txt
Thanks a lot and look forward for your reply.
Best, Xiaohan
Hi Xiaohan,
that looks much better, although it still doesn't reach the performance we could get in the paper ... On first glance, this looks alright. Did you also include the monaco (PBMC2) dataset for prediction on PBMC1?
I'm pretty busy this week so I'll probably won't be able to look into this. Hopefully next week though!
Best, Kevin
Ah sorry I misread you did only test on the PBMC2 (monaco) dataset and didn't get the matching results. Okay, I'll have a look at this when I find some time.
The results are not terribly off, so I'm currently assuming it might just be some slight pre-processing difference. Which doesn't mean that's ideal - it should be more robust. I'll have a look at the file you provided. Thanks!
Hi, I had a quick look, it's not exactly the same, the file that I used had more genes in it: monaco_samples.txt
could you try with this?
Make sure to modify nothing manually, just run scaden process pbmc_data.h5ad monaco_samples.txt
The results won't be exactly the same, I introduced code to make it fully reproducible only lately (slight variability during training and prediction is expected due to non-reproducible operations on the GPU and random keys), but it should be very close.
Hi Kevin,
Thanks a lot for your reply. I will try with this data.
Best, Xiaohan
Hi Kevin,
I conducted experiments on real bulk data (by setting the train_datasets as 'data6k,data8k,donorA,donorC', and setting test_dataset as 'sdy67'). I noticed that the 'sdy67' dataset is already included in the "pbmc_data.h5ad" file. Should I directly use these data without preprocessing? I got unsatisfying result (which is much lower than the metrics you reported in the paper): Can you suggest what is wrong with my experiments? Thanks a lot.
Best, Xiaohan