Query dataset misses genes that were used for reference model training

Xinlei-Gao commented 12 months ago

Dear author,

I successfully ran the jupyter notebook named "tabula_sapiens_tutorial.ipynb" using the sample data in the notebook. However, when I tried to use my own query data, I encountered the following error:

AssertionError Traceback (most recent call last) Cell In[35], line 3 1 from popv.preprocessing import Process_Query ----> 3 adata = Process_Query( 4 query_adata, 5 ref_adata, 6 query_labels_key=query_labels_key, 7 query_batch_key=query_batch_key, 8 ref_labels_key=ref_labels_key, 9 ref_batch_key=ref_batch_key, 10 unknown_celltype_label=unknown_celltype_label, 11 save_path_trained_models=output_model_fn, 12 cl_obo_folder="./PopV/ontology/", 13 prediction_mode="inference", # 'fast' mode gives fast results (does not include BBKNN and Scanorama and makes more inaccurate predictions) 14 n_samples_per_label=n_samples_per_label, 15 use_gpu=0, 16 compute_embedding=True, 17 hvg=None, 18 ).adata

File /lab/BaRC_projects/scRNAseq_cell_type_classification/scRNAseq_cell_type_venv/lib/python3.8/site-packages/popv/preprocessing.py:142, in Process_Query.init(self, query_adata, ref_adata, ref_labels_key, ref_batch_key, query_labels_key, query_batch_key, query_layers_key, prediction_mode, cl_obo_folder, unknown_celltype_label, n_samples_per_label, pretrained_scvi_path, save_path_trained_models, hvg, use_gpu, compute_embedding, return_probabilities) 139 self.genes = list(pretrained_scvi_genes) 141 if self.genes is not None: --> 142 assert set(self.genes).issubset( 143 set(query_adata.var_names) 144 ), "Query dataset misses genes that were used for reference model training. Retrain reference model, set mode='retrain'" 145 self.query_adata = query_adata[:, self.genes].copy() 146 assert ( 147 hvg is None 148 ), "Highly variable gene selection is not available if using trained reference model."

AssertionError: Query dataset misses genes that were used for reference model training. Retrain reference model, set mode='retrain'

It seems that my query data doesn't contain all the genes used for reference model training.

I checked the gene names of my query data and reference adata:

query_adata.var_names

CategoricalIndex(['UBE2B', 'CNST', 'CYP4F35P', 'KRTAP5-6', 'SLC24A3', 'ROCK1P1', 'MIR8088', 'ZNF518A', 'OLFM2', 'CCDC30', ... 'MIR3621', 'HP', 'EHD4', 'SFRP5', 'SNORD20', 'OR56A1', 'ANKRD62', 'C1orf146', 'SYK', 'B3GNT6'], categories=['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', ..., 'ZYG11B', 'ZYX', 'ZZEF1', 'ZZZ3'], ordered=False, dtype='category', name='feature_name', length=24857)

ref_adata.var_names

Index(['DDX11L1', 'WASH7P', 'MIR6859-1', 'MIR1302-2HG', 'MIR1302-2', 'FAM138A', 'OR4G4P', 'OR4G11P', 'OR4F5', 'RP11-34P13.7', ... 'MT-ND4', 'MT-TH', 'MT-TS2', 'MT-TL2', 'MT-ND5', 'MT-ND6', 'MT-TE', 'MT-CYB', 'MT-TT', 'MT-TP'], dtype='object', name='feature_name', length=58559)

Reference adata has 58559 features while my query data has 24857 features, so that I can not proceed.

How can I deal with this issue? Is there any way I can modify my query data to include all the features in the reference? Or should I modify the reference model to only retain the features in my query data?

Thank you for your help and suggestions!

Best,

Xinlei

canergen commented 12 months ago

To use the pretrained models indeed all genes have to be in the input dataframe. If you use mode='retrained as the error is printing, it will select a new set of overlapping genes and you will not face the problem. Otherwise, you can take the scVI or scANVI model and fill your dataset with zeros for all missing genes using: scvi.model.SCANVI.prepare_query_anndata(query_data, local_dir) where local_dir is the path to the folder with the trained SCANVI model. Query_data contains afterwards new rows filled with zeros for missing genes. This might impact performance for PopV though.

Laolga commented 11 months ago

Where exactly mode= 'retrain' should go?

Doesn't look like it goes into Process_Query function

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/data/olga/xenium_panel/cell_type_anno.ipynb](https://vscode-remote+ssh-002dremote-002bgpu6.vscode-resource.vscode-cdn.net/data/olga/xenium_panel/cell_type_anno.ipynb) Cell 13 line 3
      [1](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) from popv.preprocessing import Process_Query
----> [3](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) adata = Process_Query(
      [4](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3)     query_adata,
      [5](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4)     ref_adata,
      [6](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5)     query_labels_key=query_labels_key,
      [7](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6)     query_batch_key=query_batch_key,
      [8](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7)     ref_labels_key=ref_labels_key,
      [9](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8)     ref_batch_key=ref_batch_key,
     [10](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9)     unknown_celltype_label=unknown_celltype_label,
     [11](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10)     save_path_trained_models=output_model_fn,
     [12](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=11)     cl_obo_folder="[./PopV/resources/ontology/](https://vscode-remote+ssh-002dremote-002bgpu6.vscode-resource.vscode-cdn.net/data/olga/xenium_panel/PopV/resources/ontology/)",
     [13](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=12)     prediction_mode="inference",  # 'fast' mode gives fast results (does not include BBKNN and Scanorama and makes more inaccurate predictions)
     [14](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=13)     n_samples_per_label=n_samples_per_label,
     [15](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=14)     accelerator="cuda",
     [16](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=15)     compute_embedding=True,
     [17](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=16)     hvg=None,
     [18](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=17)     mode='retrain'
     [19](vscode-notebook-cell://ssh-remote%2Bgpu6/data/olga/xenium_panel/cell_type_anno.ipynb#X15sdnNjb2RlLXJlbW90ZQ%3D%3D?line=18) ).adata

TypeError: __init__() got an unexpected keyword argument 'mode'

canergen commented 11 months ago

That was an incomplete message above from me. It needs to be prediction_mode="retrain" in Process_Query.

YosefLab / PopV

Query dataset misses genes that were used for reference model training #32