bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
1.01k stars 197 forks source link

ValueError: zero-size array to reduction operation maximum which has no identity when running embed_data #200

Open jmrussell opened 5 months ago

jmrussell commented 5 months ago

Hello,

I am trying to run scg.tasks.embed_data on this dataset: https://storage.googleapis.com/linnarsson-lab-human/human_dev_GRCh38-3.0.0.h5ad from https://github.com/linnarsson-lab/developing-human-brain/

I will get to the point where the tqdm progress bar shows up for embedding cells, but after 90 or so it fails with this error message

  scGPT - INFO - match 23336/33538 genes in vocabulary of size 60697.
  /home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/nn/modules/transformer.py:282: UserWarning:
   enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer was not TransformerEncoderLay
  er
    warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_pat
  h}")
  Embedding cells:   0%|▏                                                          | 94/26031 [00:14<1:07:04,  6.44it/s]
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/scgpt/tasks/cell_emb.py", line 263, in em
  bed_data
      cell_embeddings = get_batch_cell_embeddings(
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/scgpt/tasks/cell_emb.py", line 122, in ge
  t_batch_cell_embeddings
      for data_dict in tqdm(data_loader, desc="Embedding cells"):
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
      for obj in iterable:
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630
  , in __next__
      data = self._next_data()
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 134
  5, in _next_data
      return self._process_data(data)
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 137
  1, in _process_data
      data.reraise()
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/_utils.py", line 694, in reraise
      raise exception
  ValueError: Caught ValueError in DataLoader worker process 6.
  Original Traceback (most recent call last):
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line
  308, in _worker_loop
      data = fetcher.fetch(index)
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 5
  4, in fetch
      return self.collate_fn(data)
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/scgpt/data_collator.py", line 88, in __ca
  ll__
      expressions[self.keep_first_n_tokens :] = binning(
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/scgpt/preprocess.py", line 283, in binnin
  g
      if row.max() == 0:
    File "/home/jr2396/miniconda3/envs/scgpt-0.2.1/lib/python3.8/site-packages/numpy/core/_methods.py", line 41, in _ama
  x
      return umr_maximum(a, axis, None, out, keepdims, initial, where)
  ValueError: zero-size array to reduction operation maximum which has no identity

I don't think I'm hitting the RAM limit (I have 700GB of RAM) or the GPU-ram (I'm on a 40GB A100, and I only see about 8GB of GPU RAM allocated before it crashes).

I have tried subsetting the object down to 10% of it's size, and it still fails. I am able to successfully run this on data that we've generated in house. Any guidance would be appreciated.

HelloWorldLTY commented 4 months ago

Same error here, when trying to query the cell embeddings for a group number of cells.

rpeys commented 4 months ago

Same error! At first I thought that I included a cell or gene with 0 expression, but I did not. [EDIT: turns out cells with 0 expression indeed were the problem]. My python==3.10.11, torch==2.0.1, numpy==1.24.4. Breaks on one dataset, but not another.

my code:

hlca_adata_hvg = scg.tasks.embed_data(
    hlca_adata_hvg,
    model_dir,
    gene_col="feature_name",
    batch_size=64,
    return_new_adata=False,
)

output:

scGPT - INFO - match 2707/3000 genes in vocabulary of size 60697.
Embedding cells:   2%|█▎                                                                             | 615/35664 [00:32<30:25, 19.20it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 hlca_adata_hvg = scg.tasks.embed_data(
      2     hlca_adata_hvg,
      3     model_dir,
      4     gene_col="feature_name",
      5     batch_size=64,
      6     return_new_adata=False,
      7 )

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/tasks/cell_emb.py:263, in embed_data(adata_or_file, model_dir, gene_col, max_length, batch_size, obs_to_save, device, use_fast_transformer, return_new_adata)
    260 model.eval()
    262 # get cell embeddings
--> 263 cell_embeddings = get_batch_cell_embeddings(
    264     adata,
    265     cell_embedding_mode="cls",
    266     model=model,
    267     vocab=vocab,
    268     max_length=max_length,
    269     batch_size=batch_size,
    270     model_configs=model_configs,
    271     gene_ids=gene_ids,
    272     use_batch_labels=False,
    273 )
    275 if return_new_adata:
    276     obs_df = adata.obs[obs_to_save] if obs_to_save is not None else None

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/tasks/cell_emb.py:122, in get_batch_cell_embeddings(adata, cell_embedding_mode, model, vocab, max_length, batch_size, model_configs, gene_ids, use_batch_labels)
    120 with torch.no_grad(), torch.cuda.amp.autocast(enabled=True):
    121     count = 0
--> 122     for data_dict in tqdm(data_loader, desc="Embedding cells"):
    123         input_gene_ids = data_dict["gene"].to(device)
    124         src_key_padding_mask = input_gene_ids.eq(
    125             vocab[model_configs["pad_token"]]
    126         )

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
   1178 time = self._time
   1180 try:
-> 1181     for obj in iterable:
   1182         yield obj
   1183         # Update and possibly print the progressbar.
   1184         # Note: does not call self.update(1) for speed optimisation.

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633, in _BaseDataLoaderIter.__next__(self)
    630 if self._sampler_iter is None:
    631     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    632     self._reset()  # type: ignore[call-arg]
--> 633 data = self._next_data()
    634 self._num_yielded += 1
    635 if self._dataset_kind == _DatasetKind.Iterable and \
    636         self._IterableDataset_len_called is not None and \
    637         self._num_yielded > self._IterableDataset_len_called:

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1345, in _MultiProcessingDataLoaderIter._next_data(self)
   1343 else:
   1344     del self._task_info[idx]
-> 1345     return self._process_data(data)

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1371, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1369 self._try_put_index()
   1370 if isinstance(data, ExceptionWrapper):
-> 1371     data.reraise()
   1372 return data

File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/_utils.py:644, in ExceptionWrapper.reraise(self)
    640 except TypeError:
    641     # If the exception takes multiple arguments, don't try to
    642     # instantiate since we don't know how to
    643     raise RuntimeError(msg) from None
--> 644 raise exception

ValueError: Caught ValueError in DataLoader worker process 39.
Original Traceback (most recent call last):
  File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/data_collator.py", line 88, in __call__
    expressions[self.keep_first_n_tokens :] = binning(
  File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/preprocess.py", line 283, in binning
    if row.max() == 0:
  File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/numpy/core/_methods.py", line 41, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
rpeys commented 4 months ago

Actually, I solved it in my case -- the issue was that I was subsetting my anndata object to highly variable genes, and once I did that, there were some cells that had 0 expression for all remaining genes. Once I fixed that, the embed function worked!

HelloWorldLTY commented 4 months ago

Actually, I solved it in my case -- the issue was that I was subsetting my anndata object to highly variable genes, and once I did that, there were some cells that had 0 expression for all remaining genes. Once I fixed that, the embed function worked!

Thanks for your sharing. But it does not fit my case because we think for cells with zero expression could also be meaningful, especially in the tasks like imputation (these cells'expressed genes may only not be detected). I can generate embeddings of my cells based on either SCimilarity or CellPLM.