Closed JeanRadig closed 5 months ago
Cells in imputed.h5ad
are CellOT's predictions of the perturbed states of cells in control.h5ad
. As this are predictions, they are of course not equal to cells in treated.h5ad
, which contains the perturbed cells. CellOT learns to predict the perturbed states of cells based on the minimum effort principle, i.e., during training we learn a function that maps cells in control.h5ad
to their "corresponding" cell in treated.h5ad
according to the optimal transport principle. At test time, we now map previously unseen control cells into their perturbed state (resulting in imputed.h5ad
).
If you are interested in simply computing an alignment between control and perturbed cells, take a look at standard OT solvers (without neural networks) as in OTT
or POT
.
Thank you very much for your answer! When I run the vignette I only get the file imputed.h5ad, therefore my confusion. I will see how I can ensure that also the control and the corresponding treated sets are printed out. Thank you for the information and the clarification.
In cellot/cellot/utils/evaluate.py
the control, treated and imputed data are returned by the function load_conditions
:
def load_conditions(expdir, where, setting, embedding=None):
...
return control, treated, imputed
Which are then used in cellot/scripts/evaluate.py
:
def main():
def iterate_feature_slices():
_, treateddf, imputed = load_conditions(
expdir, where, setting, embedding=embedding)
imputed.write(cache)
imputeddf = imputed.to_df()
But we see that only the imputed data is returned. We therefore modify it the following way to also save the control and treated data:
control_cache = outdir/ 'control.h5ad'
treated_cache = outdir/ 'treated.h5ad'
controleddf, treateddf, imputed = load_conditions(
expdir, where, setting, embedding=embedding)
imputed.write(cache)
# Save control cells in .h5ad format
controledad = ad.AnnData(X=controleddf.values)
controledad.obs_names = controleddf.index
controledad.var_names = controleddf.columns
controledad.write(control_cache)
# Save treated cells in .h5ad format
treatedad = ad.AnnData(X=treateddf.values)
treatedad.obs_names = treateddf.index
treatedad.var_names = treateddf.columns
treatedad.write(treated_cache)
I have trained the model in iid mode, which sees part of the evaluation data, as defined in
(https://www.nature.com/articles/s41592-023-01969-x)
Definitions found for iid
(p.1763) Independent and identical distributed setting: models see cells from all patients
(p.1765) i.i.d. trained with additional access to half of the cells in the holdout sample
Hence, this mode should yield the best results.
I have used the mode data_space instead of latent_space. I am asking myself whether this is the option to choose when using the cellot model. If both data space and latent space can be used, what is the best option in general?
The model was trained using:
python ./scripts/train.py --outdir ./results/statefate/model-cellot/iid --config ./configs/tasks/statefate-in_vitro-iid.yaml --config ./configs/models/cellot.yaml
And evaluated using:
python ./scripts/evaluate.py --outdir ./results/statefate/model-cellot/iid --setting iid --where data_space
And I received following anndata files as described earlier:
control.h5ad
treated.h5ad
imputed.h5ad
Visulasing the umaps of the different anndatas we observe following clusterings. The treated and imputed cells do not overlap significantly. Have I trained the model using the correct settings?
Is this the correct way to use cellot?
Hi Jean -- I am not 100% sure what the question is, but I hope this can clear up some confusion.
At its core, CellOT is training a neural network to predict single-cell treatment responses. The "best" use case will depend on the nature of your application.
RE: data space vs latent space Because the dimensionality of scRNA-seq is large (>> 1k) and we need to rely on a Euclidean transport cost, we perform the transportation on embeddings of cells in a manageable representation space, i.e. the latent space of an autoencoder. To be consistent with other models we compute all of our metrics in the data space, on the decoded predictions. If you have a data modality of ~ 100 features you can likely run CellOT directly on the data itself. This is how we approached the proteomics 4i datasets.
RE: iid vs ood This essentially effects the behavior of the dataloader at evaluation time. In an IID setting, the training and evaluation set are drawn from the same distribution, e.g. from the same sample. The OOD setting is more challenging, as it asks for the model to generalize to e.g. unseen samples. In this setting there is a distributional shift in the test set.
Again, the choices here entirely depend on your research question or application. I hope this helps!
Hi Stefan, that's very clear.
To summarise, if I want to be able to predict the effect of a condition on my control cells (given I have split train/val in 80/20 ratio), given I have scRNA-seq data, I should then train in ood mode (deriving the effect of a condition on unseen cells) from the latent space (because scRNA-seq contains thousands of variables).
Sounds reasonable?
Thank you very much for your time and your help
Yes so definitely train in the latent space.
As for IID vs OOD this depends on how you want to use the model. It sounds like you have "one" dataset and you want to understand how the control cells responded to the treatment, for which you have already observed treated cells. If this is the case then I would consider this an "IID" setting. And here you can also consider the other tools Charlotte mentioned.
While it is totally valid to apply CellOT in this IID setting, what differentiates CellOT from these other tools is its ability to apply OT-powered predictions as a parameterized function. This allows CellOT to predict the responses of control cells without observing any of their treated states. For instance, say you have a cohort of samples for which you have observed their treatment responses. You can train CellOT to learn these responses in order to predict the responses an incoming sample without measuring their treatments. This task is what we call "OOD" since the cells and responses from incoming sample exhibit some distributional shift to the training set. Predicting the responses of incoming samples can also be achieved with the auto-encoder approaches we discuss in the paper, but the typical OT method assumes that your target distribution is somehow observed and so they cannot be applied to this prediction task.
I have another question, concerning the plotting this time. I did not manage to find the parameters that you have used to create the umaps. Could you point me to the place where I can find them, or paste them here, if you have them? PS: I know from https://github.com/bunnech/cellot/issues/24 that plotting is not supported anymore, but I just wanted to create umaps with the same initial seeds as yours.
As mentioned in previous issues, we are not anymore supporting plot.py
. As your other questions seem to be answered, I will proceed and close this issue.
Question: are the cells in imputed.h5ad the simulated cells?
Sub-question 1: why are the control cells not included in the imputed.h5ad?
Sub-question 2: why is the number of variables between original and imputed not kept equal?
Background:
Training the model on the 4i data as given in the vignette, I obtain a file imputed.h5ad. I remark several things and would like to ensure that the the results from the file are indeed the imputed results of the model. I trained the model with
And evaluated with
And obtained imputed.h5ad.
The content of the imputed.h5ad file is as:
Where the number variables differ from that of the original dataset.
And no information is available concerning which variables were kept for the imputed data.
The question is therefore the following: