Closed vdet closed 3 years ago
Hi Vincent,
your pipeline is almost correct, but the process
step is absolutely necessary. It ensures that the same genes are used but more importantly, applies normalization and scaling methods to the data, which is crucial for successful machine learning.
I would also advise to opt for 5000 steps of training, which is the value that I found works best for most scenarios.
You should not log-transform the count data, as Scaden applies some processing to that as well internally, making sure it is on the same scale as the training data. That you get the same results is interesting, but might be caused be what I just said (the data will have different values but adhere to the same distribution, leading to the same predictions in the end).
The reason why it didn't work for your training run is that you (accidentally) loaded the trained models before. If you have run Scaden in the same directory as before, it will look for pre-trained models and load them (maybe I should turn that behavior off I'm now thinking?).
It tells you here:
Model parameters restored successfully
So best specifiy your model directory with --model_dir my_model
during training and prediction. Then it should work!
(Your old model was trained on data of a different shape, by the way, causing the problem).
I have been working on a helper function for Scaden to make the training process more clear, but it's currently only available in the development version. So if you want an end to end pipeline, you can clone the repo, check-out the development
branch, cd into the directory and install with pip install scaden
.
Then you can run scaden example
which will generate three small example files in your directory:
You can use them for a complete simulate - process - train -predict workflow! But I'm quite optimistic that it will work with your data as well :-)
Don't hesitate to come back with questions in case you still encounter issues!
Cheers, Kevin
Thanks so much Kevin. I know result that start to look better. Vincent
Hi Kevin,
I am confused about 'simulate' and 'process'. Actually, a simple end-to-end example starting from count matrices up to prediction with examples for all input/intermediate/output files would help a lot.
Practically, my inputs are
Is this the correct pipeline?
This runs without error messages but the predictions are grossly wrong, so wondered if I missed something. The single cell and bulk come from the same piece of tissue, so I am 100% sure the single cells match the cells within the bulks.
I was surprised that replacing 'bulk.txt' by its log2-transformed and [0,1]-scaled version yield the exact same prediction. Is that expected?
I tried to run 'process':
it completed, but then 'train' generated an error (see below). Is 'process' actually needed after 'simulate' if my single cell bulk matrix has the same genes in the same order?
Another question: my bulks are actually mini bulks with 10-100 cells and super low coverage. What parameters would you advise for 'simulate' ?
Again, thank you so much for your help
Vincent