Error whene gene in of geneset is not contained in anndata

junyho486 commented 1 year ago

Hi, thank you for creating spectra, promising concept! I tried it on CD8 pos TILs with the given genesets, but ran into two problems when initializing: All with the provided genesets.

If the gene in one of the genesets is not contained in the anndata, a Key Error is thrown.
When computing th init scores I get float division by zero error Traceback

ZeroDivisionError                         Traceback (most recent call last)
Cell In [21], line 1
----> 1 model = spc.est_spectra(adata = cd8_ad,  gene_set_dictionary = gene_set_dictionary,  use_highly_variable = False, lam = 0.01,cell_type_key = "Celltype Pred",use_cell_types=True,filter_sets=True)
      3 #save trained model
      4 model.save("test_model")

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:1195, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
   1193 if use_cell_types:
   1194     new_gs_dict = {}
-> 1195     init_scores = compute_init_scores(gene_set_dictionary,word2id, torch.Tensor(X))
   1196     for ct in gene_set_dictionary.keys():
   1197         new_gs_dict[ct] = {}

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:47, in compute_init_scores(gs_dict, word2id, W)
     45         gs = gs_dict[key][inner_key]
     46         idxs = [word2id[word] for word in gs]
---> 47         coh = mimno_coherence_2011(idxs, W)
     48         init_scores[key][inner_key] = coh.item() 
     49 else: 

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/initialization.py:25, in mimno_coherence_2011(words, W)
     23         score += mimno_coherence_single(words[j1], words[j2], W)
     24 denom = V*(V-1)/2
---> 25 return(score/denom)

ZeroDivisionError: float division by zero

If you could advise me, I would be very happy!

russellkune commented 1 year ago

Thanks for pointing this out. Working on these issues. Issue 2 is caused by a gene set having fewer than 2 genes so if the dictionary is modified to remove those it should run

junyho486 commented 1 year ago

Thank you for the quick response!

I subsetted all genesets with >3 genes contained in the anndata & only global + CD(_T now I receive this error:

ValueError                                Traceback (most recent call last)

Cell In [45], line 1
----> 1 model = spc.est_spectra(adata = cd8_ad,gene_set_dictionary = cd8_clean,  use_highly_variable = False, lam = 0.01,cell_type_key = "Celltype Pred",use_cell_types=True,filter_sets=True)
      3 #save trained model
      4 model.save("cd8_model")

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:1154, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
   1152         for key2 in gene_set_dictionary[key]:
   1153             gene_list = gene_set_dictionary[key][key2] 
-> 1154             lst += gene_list
   1155 else:
   1156     for key in gene_set_dictionary:

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     68             return NotImplemented
     70 other = item_from_zerodim(other)
---> 72 return method(self, other)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/arraylike.py:106, in OpsMixin.__radd__(self, other)
    104 @unpack_zerodim_and_defer("__radd__")
    105 def __radd__(self, other):
--> 106     return self._arith_method(other, roperator.radd)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/indexes/base.py:7050, in Index._arith_method(self, other, op)
   7040 if (
   7041     isinstance(other, Index)
   7042     and is_object_dtype(other.dtype)
   (...)
   7046     # a chance to implement ops before we unwrap them.
   7047     # See https://github.com/pandas-dev/pandas/issues/31109
   7048     return NotImplemented
-> 7050 return super()._arith_method(other, op)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/base.py:1325, in IndexOpsMixin._arith_method(self, other, op)
   1322 rvalues = ensure_wrapped_if_datetimelike(rvalues)
   1324 with np.errstate(all="ignore"):
-> 1325     result = ops.arithmetic_op(lvalues, rvalues, op)
   1327 return self._construct_result(result, name=res_name)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/ops/array_ops.py:226, in arithmetic_op(left, right, op)
    222     _bool_arith_check(op, left, right)
    224     # error: Argument 1 to "_na_arithmetic_op" has incompatible type
    225     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
--> 226     res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
    228 return res_values

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/ops/array_ops.py:165, in _na_arithmetic_op(left, right, op, is_cmp)
    162     func = partial(expressions.evaluate, op)
    164 try:
--> 165     result = func(left, right)
    166 except TypeError:
    167     if not is_cmp and (is_object_dtype(left.dtype) or is_object_dtype(right)):
    168         # For object dtype, fallback to a masked operation (only operating
    169         #  on the non-missing values)
    170         # Don't do this for comparisons, as that will handle complex numbers
    171         #  incorrectly, see GH#32047

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/computation/expressions.py:241, in evaluate(op, a, b, use_numexpr)
    238 if op_str is not None:
    239     if use_numexpr:
    240         # error: "None" not callable
--> 241         return _evaluate(op, op_str, a, b)  # type: ignore[misc]
    242 return _evaluate_standard(op, op_str, a, b)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/computation/expressions.py:129, in _evaluate_numexpr(op, op_str, a, b)
    126     _store_test_result(result is not None)
    128 if result is None:
--> 129     result = _evaluate_standard(op, op_str, a, b)
    131 return result

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/computation/expressions.py:70, in _evaluate_standard(op, op_str, a, b)
     68 if _TEST_MODE:
     69     _store_test_result(False)
---> 70 return op(a, b)

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/pandas/core/roperator.py:11, in radd(left, right)
     10 def radd(left, right):
---> 11     return right + left

junyho486 commented 1 year ago

Another Bug I encountered is a Value Error following Training:

model = spc.est_spectra(adata = cd8_ad,gene_set_dictionary = cd8_clean, use_highly_variable = False, lam = 0.01,use_cell_types=True,cell_type_key = "Celltype_labels",num_epochs= 1) #1000

ValueError                                Traceback (most recent call last)
Cell In [13], line 1
----> 1 model = spc.est_spectra(adata = cd8_ad,gene_set_dictionary = cd8_clean,  use_highly_variable = False, lam = 0.01,use_cell_types=True,cell_type_key = "Celltype_labels",num_epochs= 1) #1000
      2 #,cell_type_key = "Celltype_labels"
      3 #save trained model
      4 model.save("cd8_model")

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:1223, in est_spectra(adata, gene_set_dictionary, L, use_highly_variable, cell_type_key, use_weights, lam, delta, kappa, rho, use_cell_types, n_top_vals, filter_sets, **kwargs)
   1219 spectra = SPECTRA_Model(X = X, labels = labels,  L = L, vocab = vocab, gs_dict = gene_set_dictionary, use_weights = use_weights, lam = lam, delta=delta,kappa = kappa, rho = rho, use_cell_types = use_cell_types)
   1221 spectra.initialize(gene_set_dictionary, word2id, X, init_scores)
-> 1223 spectra.train(X = X, labels = labels,**kwargs)
   1225 adata.uns["SPECTRA_factors"] = spectra.factors
   1226 adata.obsm["SPECTRA_cell_scores"] = spectra.cell_scores

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:582, in SPECTRA_Model.train(self, X, labels, lr_schedule, num_epochs, verbose)
    579 #add all model parameters as attributes 
    581 if self.use_cell_types:
--> 582     self.__store_parameters(labels)
    583 else:
    584     self.__store_parameters_no_celltypes()

File ~/anaconda3/envs/SCANPY_ENV/lib/python3.9/site-packages/spectra/spectra.py:655, in SPECTRA_Model.__store_parameters(self, labels)
    653 scaled = np.array(lst)
    654 new_factors = scaled/(scaled.sum(axis = 0,keepdims =True) + 1.0)
--> 655 cell_scores = out*scaled.mean(axis = 1).reshape(1,-1) 
    656 self.cell_scores = cell_scores
    657 self.factors = new_factors

ValueError: operands could not be broadcast together with shapes (8468,10) (1,8)

wallet-maker commented 1 year ago

Hi,

thank you for using Spectra and please excuse the delayed response. We realized that our tutorial was very minimal and we have expanded it substantially: https://github.com/dpeerlab/spectra/blob/main/notebooks/example_notebook.ipynb

Perhaps you can try it out and it might help you resolve your issue . I think what might be happening is that your gene set annotation dictionary has the wrong format.

It is important that the cell types in your adata and in the gene set annotation dictionary are identical. Hence, for cell types where you don't have any gene sets you must add an empty dictionary under that cell type in the gene set annotation dictionary. Also, like Russell mentioned above, all gene sets should have >2 genes.

Please let us know if that helps.

Thank you, Thomas

junyho486 commented 1 year ago

Hi Thomas,

Thank you for your response and the valuable information you provided. It has helped me a great deal in understanding the concept and everything is now functioning as expected.

Regarding my next question, I have integrated single-cell datasets and I am curious to know how your algorithm will perform on them. Can it still be utilized effectively in this context?

Thank you again for your time and assistance.

Best regards,

wallet-maker commented 1 year ago

Hi,

that is an excellent question. We haven't tried this on integrated data yet because we find it is often not necessary for non-neoplastic cells (immune, stroma etc). We have tried different normalizations (scran and median library size normalization) which worked well.

My best guess is that Spectra will work on integrated data if the gene expression counts are batch-corrected. Many batch correction/integration methods only give you a batch corrected latent space, which Spectra will not use and leave the gene expression counts untouched. Even if you have strong batch effects I think you can still try to run Spectra. You might get a factor for each batch but you should be able to find some global overarching factors. Another thing you can try especially if you're just looking one cell type, is to use the batch instead of the celltype in Spectra and set all the gene sets as global. This way Spectra will normalize the batches individually and this should facilitate finding overarching programs

dpeerlab / spectra

Error whene gene in of geneset is not contained in anndata #4