Open jmrussell opened 5 months ago
Same error here, when trying to query the cell embeddings for a group number of cells.
Same error! At first I thought that I included a cell or gene with 0 expression, but I did not. [EDIT: turns out cells with 0 expression indeed were the problem]. My python==3.10.11, torch==2.0.1, numpy==1.24.4. Breaks on one dataset, but not another.
my code:
hlca_adata_hvg = scg.tasks.embed_data(
hlca_adata_hvg,
model_dir,
gene_col="feature_name",
batch_size=64,
return_new_adata=False,
)
output:
scGPT - INFO - match 2707/3000 genes in vocabulary of size 60697.
Embedding cells: 2%|█▎ | 615/35664 [00:32<30:25, 19.20it/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[10], line 1
----> 1 hlca_adata_hvg = scg.tasks.embed_data(
2 hlca_adata_hvg,
3 model_dir,
4 gene_col="feature_name",
5 batch_size=64,
6 return_new_adata=False,
7 )
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/tasks/cell_emb.py:263, in embed_data(adata_or_file, model_dir, gene_col, max_length, batch_size, obs_to_save, device, use_fast_transformer, return_new_adata)
260 model.eval()
262 # get cell embeddings
--> 263 cell_embeddings = get_batch_cell_embeddings(
264 adata,
265 cell_embedding_mode="cls",
266 model=model,
267 vocab=vocab,
268 max_length=max_length,
269 batch_size=batch_size,
270 model_configs=model_configs,
271 gene_ids=gene_ids,
272 use_batch_labels=False,
273 )
275 if return_new_adata:
276 obs_df = adata.obs[obs_to_save] if obs_to_save is not None else None
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/tasks/cell_emb.py:122, in get_batch_cell_embeddings(adata, cell_embedding_mode, model, vocab, max_length, batch_size, model_configs, gene_ids, use_batch_labels)
120 with torch.no_grad(), torch.cuda.amp.autocast(enabled=True):
121 count = 0
--> 122 for data_dict in tqdm(data_loader, desc="Embedding cells"):
123 input_gene_ids = data_dict["gene"].to(device)
124 src_key_padding_mask = input_gene_ids.eq(
125 vocab[model_configs["pad_token"]]
126 )
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
1178 time = self._time
1180 try:
-> 1181 for obj in iterable:
1182 yield obj
1183 # Update and possibly print the progressbar.
1184 # Note: does not call self.update(1) for speed optimisation.
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633, in _BaseDataLoaderIter.__next__(self)
630 if self._sampler_iter is None:
631 # TODO(https://github.com/pytorch/pytorch/issues/76750)
632 self._reset() # type: ignore[call-arg]
--> 633 data = self._next_data()
634 self._num_yielded += 1
635 if self._dataset_kind == _DatasetKind.Iterable and \
636 self._IterableDataset_len_called is not None and \
637 self._num_yielded > self._IterableDataset_len_called:
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1345, in _MultiProcessingDataLoaderIter._next_data(self)
1343 else:
1344 del self._task_info[idx]
-> 1345 return self._process_data(data)
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1371, in _MultiProcessingDataLoaderIter._process_data(self, data)
1369 self._try_put_index()
1370 if isinstance(data, ExceptionWrapper):
-> 1371 data.reraise()
1372 return data
File /opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/_utils.py:644, in ExceptionWrapper.reraise(self)
640 except TypeError:
641 # If the exception takes multiple arguments, don't try to
642 # instantiate since we don't know how to
643 raise RuntimeError(msg) from None
--> 644 raise exception
ValueError: Caught ValueError in DataLoader worker process 39.
Original Traceback (most recent call last):
File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/data_collator.py", line 88, in __call__
expressions[self.keep_first_n_tokens :] = binning(
File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/scgpt/preprocess.py", line 283, in binning
if row.max() == 0:
File "/opt/conda/rpeyser/envs/scgpt_4/lib/python3.10/site-packages/numpy/core/_methods.py", line 41, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
Actually, I solved it in my case -- the issue was that I was subsetting my anndata object to highly variable genes, and once I did that, there were some cells that had 0 expression for all remaining genes. Once I fixed that, the embed function worked!
Actually, I solved it in my case -- the issue was that I was subsetting my anndata object to highly variable genes, and once I did that, there were some cells that had 0 expression for all remaining genes. Once I fixed that, the embed function worked!
Thanks for your sharing. But it does not fit my case because we think for cells with zero expression could also be meaningful, especially in the tasks like imputation (these cells'expressed genes may only not be detected). I can generate embeddings of my cells based on either SCimilarity or CellPLM.
Hello,
I am trying to run scg.tasks.embed_data on this dataset: https://storage.googleapis.com/linnarsson-lab-human/human_dev_GRCh38-3.0.0.h5ad from https://github.com/linnarsson-lab/developing-human-brain/
I will get to the point where the tqdm progress bar shows up for embedding cells, but after 90 or so it fails with this error message
I don't think I'm hitting the RAM limit (I have 700GB of RAM) or the GPU-ram (I'm on a 40GB A100, and I only see about 8GB of GPU RAM allocated before it crashes).
I have tried subsetting the object down to 10% of it's size, and it still fails. I am able to successfully run this on data that we've generated in house. Any guidance would be appreciated.