OmicsML / CellPLM

Official repo for CellPLM: Pre-training of Cell Language Model Beyond Single Cells.
BSD 2-Clause "Simplified" License
67 stars 6 forks source link

Bugs of running own codes for imputation #8

Open HelloWorldLTY opened 7 months ago

HelloWorldLTY commented 7 months ago

Hi, I tried to impute my own spatial datasets (as mouse) with the tutorial for imputation. However, it seems that I cannot impute it with a bug:

ValueError: None of AnnData.var.index found in pre-trained gene set. In case the input gene names are gene symbols, please enable `ensembl_auto_conversion`, or manually convert gene symbols to ensembl ids in the input dataset.

I check that my dataset is in gene name (here the genes name are all upper-case since I tried to use orthology genes.).

image

wehos commented 7 months ago

Sorry for the inconvenience. Our method used Ensembl id as gene index. We provided an automatic method to map gene names to ensembl id based on mygene here.

HelloWorldLTY commented 7 months ago

Hi, thanks. After transferring the data with this method, I meet a new bug: In this function:

pipeline.fit(train_data, # An AnnData object
            pipeline_config, # The config dictionary we created previously, optional
            split_field = 'split', #  Specify a column in .obs that contains split information
            train_split = 'train',
            valid_split = 'valid',
            batch_gene_list = batch_gene_list, # Specify genes that are measured in each batch, see previous section for more details
            device = DEVICE,
            ) 
     43 g2id = dict(zip(self.gene_list, list(range(len(self.gene_list)))))
     44 for batch in batch_gene_list:
---> 45     idx = torch.LongTensor([g2id[g] for g in batch_gene_list[batch]])
     46     self.batch_gene_mask[batch] = torch.zeros(len(g2id)).bool()
     47     self.batch_gene_mask[batch][idx] = True

KeyError: '0'

I think the reason is after transferring the gene name, there are some strange gene:

'ENSG00000137547',
  'ENSG00000120992',
  'ENSG00000187735',
  'ENSG00000047249',
  'ENSG00000023287',
  '0',
  'ENSG00000168300',
  '0-1',
wehos commented 7 months ago

Generally it is the same issue as here. Did you follow the tutorial? The tutorial should have automatically removed gene ids that are not in pretrained list.

HelloWorldLTY commented 7 months ago

Yes, I followed the tutorial but used my own datasets. The dataset I used is from tangram: https://github.com/broadinstitute/Tangram/blob/master/tutorial_tangram_with_squidpy.ipynb

I will try to remove all the genes with 0 or 0-id and then have a try🤔

wehos commented 7 months ago

Hello, I have updated the codes so that now it should work more smoothly. If you installed CellPLM with pip previously, please try pip install -U cellplm to update it accordingly. Thanks!