bunnech / cellot

Learning Single-Cell Perturbation Responses using Neural Optimal Transport
BSD 3-Clause "New" or "Revised" License
105 stars 10 forks source link

Are the cells in imputed.h5ad the simulated cells? #29

Closed JeanRadig closed 1 month ago

JeanRadig commented 1 month ago

Question: are the cells in imputed.h5ad the simulated cells?

Sub-question 1: why are the control cells not included in the imputed.h5ad?

Sub-question 2: why is the number of variables between original and imputed not kept equal?


Background:

Training the model on the 4i data as given in the vignette, I obtain a file imputed.h5ad. I remark several things and would like to ensure that the the results from the file are indeed the imputed results of the model. I trained the model with

# Copy pasted vignette
python ./scripts/train.py --outdir ./results/4i/drug-cisplatin/model-cellot --config ./configs/tasks/4i.yaml --config ./configs/models/cellot.yaml --config.data.target cisplatin

And evaluated with

# Copy pasted vignette
python ./scripts/evaluate.py --outdir ./results/4i/drug-cisplatin/model-cellot --setting iid --where data_space

And obtained imputed.h5ad.

The content of the imputed.h5ad file is as:

# Content of imputed.h5ad
Key: drug
control: 2199 # -> are those cells the simulated cells? 
cisplatin: 0  # -> values from imputed.h5ad do not correspond to 
              # any signals from the original dataset, suggesting 
              # they are simulated cells. If so, why are they not stored in
              # here? 

Key: transport
source: 2199 # -> are those cells the simulated cells? If so, why are they 
                         # stored in source?
                         # -> where are the control cells then?

Key: split
test: 2199   # Indeed corresponds to 20% of number of cisplatin cells
train: 0

Where the number variables differ from that of the original dataset.

# original data shape
(119479, 78)

# simulated data shape -> not the same number of variables 
(2199, 48)

    # AnnData object with n_obs × n_vars = 2199 × 48
    #    obs: 'drug', 'transport', 'split'
    # -> no information about the variable names chosen

And no information is available concerning which variables were kept for the imputed data.

The question is therefore the following:

  1. Are the cells in imputed .h5ad the simulated cells (as would indicated imputed.h5ad) or are those control/source cells (as would indicate control and source)?
  2. If the answer is that the cells are the simulated ones, how can one compare the simulated cells to the original ones, if the original cells are not stored alongside the simulated data in the anndata?
bunnech commented 1 month ago

Cells in imputed.h5ad are CellOT's predictions of the perturbed states of cells in control.h5ad. As this are predictions, they are of course not equal to cells in treated.h5ad, which contains the perturbed cells. CellOT learns to predict the perturbed states of cells based on the minimum effort principle, i.e., during training we learn a function that maps cells in control.h5ad to their "corresponding" cell in treated.h5ad according to the optimal transport principle. At test time, we now map previously unseen control cells into their perturbed state (resulting in imputed.h5ad). If you are interested in simply computing an alignment between control and perturbed cells, take a look at standard OT solvers (without neural networks) as in OTT or POT.

JeanRadig commented 1 month ago

Thank you very much for your answer! When I run the vignette I only get the file imputed.h5ad, therefore my confusion. I will see how I can ensure that also the control and the corresponding treated sets are printed out. Thank you for the information and the clarification.

JeanRadig commented 1 month ago

Question: what are the best settings when training cellot on single cell RNA sequencing data? (e.g. iid? ood? data_space? latent_space?)

Fetching control and treated .h5ad

In cellot/cellot/utils/evaluate.py the control, treated and imputed data are returned by the function load_conditions :

def load_conditions(expdir, where, setting, embedding=None):
    ...
    return control, treated, imputed

Which are then used in cellot/scripts/evaluate.py :

def main():
    def iterate_feature_slices():
        _, treateddf, imputed = load_conditions(
                expdir, where, setting, embedding=embedding)
    imputed.write(cache)
    imputeddf = imputed.to_df()     

But we see that only the imputed data is returned. We therefore modify it the following way to also save the control and treated data:

control_cache = outdir/ 'control.h5ad'
treated_cache = outdir/ 'treated.h5ad'

controleddf, treateddf, imputed = load_conditions(
        expdir, where, setting, embedding=embedding)

imputed.write(cache)

# Save control cells in .h5ad format
controledad = ad.AnnData(X=controleddf.values)
controledad.obs_names = controleddf.index 
controledad.var_names = controleddf.columns
controledad.write(control_cache)

# Save treated cells in .h5ad format
treatedad = ad.AnnData(X=treateddf.values)
treatedad.obs_names = treateddf.index 
treatedad.var_names = treateddf.columns
treatedad.write(treated_cache)

Evaluation of control, treated and imputed on statefate dataset, iid, data space

  1. Model setting: independent and indically distributed mode.

I have trained the model in iid mode, which sees part of the evaluation data, as defined in

(https://www.nature.com/articles/s41592-023-01969-x)

Definitions found for iid

(p.1763) Independent and identical distributed setting: models see cells from all patients

(p.1765) i.i.d. trained with additional access to half of the cells in the holdout sample

Hence, this mode should yield the best results.

  1. Reconstruction setting: data space (is it best practice??)

I have used the mode data_space instead of latent_space. I am asking myself whether this is the option to choose when using the cellot model. If both data space and latent space can be used, what is the best option in general?

The model was trained using:

python ./scripts/train.py --outdir ./results/statefate/model-cellot/iid --config ./configs/tasks/statefate-in_vitro-iid.yaml --config ./configs/models/cellot.yaml 

And evaluated using:

python ./scripts/evaluate.py --outdir ./results/statefate/model-cellot/iid --setting iid --where data_space

And I received following anndata files as described earlier:

control.h5ad
treated.h5ad
imputed.h5ad

Visulasing the umaps of the different anndatas we observe following clusterings. The treated and imputed cells do not overlap significantly. Have I trained the model using the correct settings?

Untitled Untitled-2 Untitled-3

Is this the correct way to use cellot?

stefangstark commented 1 month ago

Hi Jean -- I am not 100% sure what the question is, but I hope this can clear up some confusion.

At its core, CellOT is training a neural network to predict single-cell treatment responses. The "best" use case will depend on the nature of your application.

RE: data space vs latent space Because the dimensionality of scRNA-seq is large (>> 1k) and we need to rely on a Euclidean transport cost, we perform the transportation on embeddings of cells in a manageable representation space, i.e. the latent space of an autoencoder. To be consistent with other models we compute all of our metrics in the data space, on the decoded predictions. If you have a data modality of ~ 100 features you can likely run CellOT directly on the data itself. This is how we approached the proteomics 4i datasets.

RE: iid vs ood This essentially effects the behavior of the dataloader at evaluation time. In an IID setting, the training and evaluation set are drawn from the same distribution, e.g. from the same sample. The OOD setting is more challenging, as it asks for the model to generalize to e.g. unseen samples. In this setting there is a distributional shift in the test set.

Again, the choices here entirely depend on your research question or application. I hope this helps!

JeanRadig commented 1 month ago

Hi Stefan, that's very clear.

To summarise, if I want to be able to predict the effect of a condition on my control cells (given I have split train/val in 80/20 ratio), given I have scRNA-seq data, I should then train in ood mode (deriving the effect of a condition on unseen cells) from the latent space (because scRNA-seq contains thousands of variables).

Sounds reasonable?
Thank you very much for your time and your help

stefangstark commented 1 month ago

Yes so definitely train in the latent space.

As for IID vs OOD this depends on how you want to use the model. It sounds like you have "one" dataset and you want to understand how the control cells responded to the treatment, for which you have already observed treated cells. If this is the case then I would consider this an "IID" setting. And here you can also consider the other tools Charlotte mentioned.

While it is totally valid to apply CellOT in this IID setting, what differentiates CellOT from these other tools is its ability to apply OT-powered predictions as a parameterized function. This allows CellOT to predict the responses of control cells without observing any of their treated states. For instance, say you have a cohort of samples for which you have observed their treatment responses. You can train CellOT to learn these responses in order to predict the responses an incoming sample without measuring their treatments. This task is what we call "OOD" since the cells and responses from incoming sample exhibit some distributional shift to the training set. Predicting the responses of incoming samples can also be achieved with the auto-encoder approaches we discuss in the paper, but the typical OT method assumes that your target distribution is somehow observed and so they cannot be applied to this prediction task.

JeanRadig commented 1 month ago

I have another question, concerning the plotting this time. I did not manage to find the parameters that you have used to create the umaps. Could you point me to the place where I can find them, or paste them here, if you have them? PS: I know from https://github.com/bunnech/cellot/issues/24 that plotting is not supported anymore, but I just wanted to create umaps with the same initial seeds as yours.

bunnech commented 1 month ago

As mentioned in previous issues, we are not anymore supporting plot.py. As your other questions seem to be answered, I will proceed and close this issue.