facebookresearch / CPA

The Compositional Perturbation Autoencoder (CPA) is a deep generative framework to learn effects of perturbations at the single-cell level. CPA performs OOD predictions of unseen combinations of drugs, learns interpretable embeddings, estimates dose-response curves, and provides uncertainty estimates.
MIT License
175 stars 48 forks source link

Bug in sciplex3 preprocessing? #4

Open mughetto opened 2 years ago

mughetto commented 2 years ago

Hi there,

I've been trying to reproduce the training on sciplex but I get this error with a brand new clone, datasets and conda env:

$ python -m compert.train --dataset_path datasets/sciplex3_new.h5ad       --save_dir /tmp --max_epochs 1  --doser_type sigm

Traceback (most recent call last):
  File "/home/kcvc236/miniconda3/envs/CPAvanilla/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kcvc236/miniconda3/envs/CPAvanilla/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 303, in <module>
    train_compert(parse_arguments())
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 197, in train_compert
    autoencoder, datasets = prepare_compert(args)
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 167, in prepare_compert
    datasets = load_dataset_splits(
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 189, in load_dataset_splits
    "training": dataset.subset("train", "all"),
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 129, in subset
    return SubDataset(self, idx)
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 161, in __init__
    self.ctrl_name = dataset.ctrl_name[0]
IndexError: list index out of range

I have strong suspicion that there is a problem in the preprocessing of sciplex: https://github.com/facebookresearch/CPA/blob/main/preprocessing/sciplex3.ipynb

The cell #6 is probably causing the troubles by making it impossible for adata.obs.control to be anything else than 0. Hence the error above.

Do you have a working version or fix you could share for this please?

Cheers

bhomass commented 1 year ago

I concur. sciplex3_new.h5ad which is created by processing sciplex_rawchunk{i}.h5ad does not have the value "Vehicle_1.0" in adata.obs.drug_dose_name.values at all, and therefore, there are no cells with adata.obs['control'] = 1. Without the control samples, the training crashes.

What is the solution? use some other drug_dose_name value as the control?

bhomass commented 1 year ago

I think should be control_0.0