MarioniLab / oor_design_reproducibility

14 stars 1 forks source link

Model non-reproducible #18

Closed Nusob888 closed 1 year ago

Nusob888 commented 1 year ago

Hi,

Thanks for the great manuscript, I read it with great interest. I wanted to flag up a few issues that I came across when trying to apply your model to my own data:

1) The model deposited on figshare will not load in scvi-tools (have tried version 0.17.14 and the latest 0.19)

1a) The error being thrown is "KeyError: 'registry_'"

1b) Additionally, it throws a warning before this to state that the var_names in the PBMC_merged.normal.subsample500cells.clean_celltypes.h5ad file are not matched or ordered the same as those were used to generate the model.

2) Having tracked back through your reproducibility code 20220601_PBMC_scVI.ipynb, I noted an oddity. When you generate the reference model you use categorical_covariate_keys=['assay_ontology_term_id'].

2a) My understanding of scArches was that the current iteration cannot take additional categorical covariates other than batch. Usually scvi-tools will throw this error "scArches currently does not support models with extra categorical covariates", and will not instantiate the model.

2b) Therefore, I am now unsure how you performed all tasks downstream if you could not instantiate a query model on the reference model provided in the reproducibility code.

Any help clarifying the above would be much appreciated

emdann commented 1 year ago

Hi @Nusob888 Thank you for your interest in our study, feedback on reproducibility is always much appreciated!

The model deposited on figshare will not load in scvi-tools (have tried version 0.17.14 and the latest 0.19) 1a) The error being thrown is "KeyError: 'registry_'" 1b) Additionally, it throws a warning before this to state that the var_names in the PBMC_merged.normal.subsample500cells.clean_celltypes.h5ad file are not matched or ordered the same as those were used to generate the model.

Thank you for pointing this out. This model was supposedly trained with scvi-tools==0.16.2, but I am also having trouble loading with the correct input data. I will look into this issue and share a new model with indication of the version and stored adata.h5ad as soon as possible. Please note that this model was used only for harmonizing annotations in the PBMC dataset. The annotations were used to define OOR populations in the simulations, but, a part from that, none of the analyses presented in the ms use this model (as this includes all data from the 13 studies considered, while throughout the study certain cells are held out as controls or OOR state).

While I solve this issue, you could try using for your analyses the healthy PBMC model used as atlas dataset in the COVID analysis (model_COVID19_reference_atlas_scvi0.16.2.zip). This model is going to be qualitatively very similar to the one above, as it's trained on data from 12 of the 13 studies used for the annotation model (missing only healthy samples from Stephenson dataset). This was trained with scvi tools v0.16.2. You should be able to load the model (including the anndata object used for training, which is saved in the zip) running:

import scvi
vae = scvi.model.SCVI.load('model_COVID19_reference_atlas_scvi0.16.2')

Having tracked back through your reproducibility code 20220601_PBMC_scVI.ipynb, I noted an oddity. When you generate the reference model you use categorical_covariate_keys=['assay_ontology_term_id']. 2a) My understanding of scArches was that the current iteration cannot take additional categorical covariates other than batch. Usually scvi-tools will throw this error "scArches currently does not support models with extra categorical covariates", and will not instantiate the model. 2b) Therefore, I am now unsure how you performed all tasks downstream if you could not instantiate a query model on the reference model provided in the reproducibility code.

As noted above, the model trained in the notebook you referenced was used only for harmonizing annotations, but no query mapping with scArches was performed on this model. The script to train models used for the simulation experiments is here (using wrapper in the python package) and for the COVID-19 experiments is here.

Nusob888 commented 1 year ago

Hi @emdann Thanks for such a swift response! That all makes sense. Apologies, I had assumed the COVID19_reference_atlas was trained on a COVID only dataset.

Thanks so much for re-directing me to the correct scripts. With the clarifications, point 1 is no longer relevant to testing the scArches approach, but I will keep this issue open until you have debugged the old model. Feel free to close sooner.

Nusob888 commented 1 year ago

@emdann,

The COVID reference model worked a treat! the embeddings of our data look fab. Thanks again!

emdann commented 1 year ago

That's so great to hear! Thanks for sharing :)