CSOgroup / cellcharter

A Python package for the identification, characterization and comparison of spatial clusters from spatial -omics data.
https://cellcharter.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
69 stars 2 forks source link

failed to run cc.tl.ClusterAutoK and cc.tl.Cluster #32

Closed zhengrongbin closed 1 month ago

zhengrongbin commented 3 months ago

Report

Hey there, thank you for developing CellCharter, which is a very useful tool. I'd like to try on my own MERFISH sample. However, I got a error consistently showing the SLURM cluster node problem. Do you have any idea to fix it?

Detailed error report shown in the following:


RuntimeError Traceback (most recent call last) Cell In[4], line 13 1 # cc.gr.remove_long_links(adata1) 2 # cc.gr.aggregate_neighbors(adata1, n_layers=3, use_rep='X_umap', out_key='X_cellcharter', sample_key='sample') 3 autok = cc.tl.ClusterAutoK( 4 n_clusters=(2,10), 5 max_runs = 10, (...) 10 ) 11 ) ---> 13 autok.fit(adata1, use_rep='X_cellcharter') 15 # gmm = cc.tl.Cluster( 16 # n_clusters=15, 17 # random_state=12345, (...) 21 # gmm.fit(adata1, use_rep='X_umap') 22 # adata1.obs['spatial_cluster'] = gmm.predict(adata1, use_rep='X_cellcharter')

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/cellcharter/tl/_autok.py:109, in ClusterAutoK.fit(self, adata, use_rep) 107 for k in tqdm(self.n_clusters, disable=(len(self.n_clusters) == 1)): 108 clustering = self.model_class(n_clusters=k, random_state=i + random_state, **self.model_params) --> 109 clustering.fit(X) 110 new_labels[k] = clustering.predict(X) 112 if (k not in self.bestmodels.keys()) or (clustering.nll < self.bestmodels[k].nll):

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/cellcharter/tl/_gmm.py:115, in GaussianMixture.fit(self, data) 113 kmeans.fit(data) 114 self.init_means = torch.tensor(kmeans.clustercenters).float() --> 115 return self._fit(data)

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/cellcharter/tl/_gmm.py:119, in GaussianMixture._fit(self, data) 117 def _fit(self, data) -> GaussianMixture: 118 try: --> 119 return super().fit(data) 120 except torch._C._LinAlgError: 121 self.covariance_regularization *= 10

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/pycave/bayes/gmm/estimator.py:142, in GaussianMixture.fit(self, data) 136 # Setup the data loading 137 loader = DataLoader( 138 dataset_from_tensors(data), 139 batch_size=self.batch_size or len(data), 140 collate_fn=collate_tensor, 141 ) --> 142 is_batch_training = self._num_batches_per_epoch(loader) == 1 144 # Run k-means if required or copy means 145 if self.init_means is not None:

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/lightkit/estimator/base.py:347, in BaseEstimator._num_batches_per_epoch(self, loader) 340 def _num_batches_per_epoch(self, loader: DataLoader[Any]) -> int: 341 """Returns the number of batches that are run for the given data loader 342 across all processes when using the trainer provided by the 343 :meth:trainer method. If n processes run. 344 345 k batches each, this method returns k * n. 346 """ --> 347 trainer = self.trainer() 348 num_batches = len(loader) # type: ignore 349 kwargs = trainer.distributed_sampler_kwargs

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/lightkit/estimator/base.py:110, in BaseEstimator.trainer(self, kwargs) 93 def trainer(self, kwargs: Any) -> pl.Trainer: 94 """ 95 Returns the trainer as configured by the estimator. Typically, this method is only called 96 by functions in the estimator. (...) 108 introduced in the future. 109 """ --> 110 return pl.Trainer({self.trainer_params, **kwargs})

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py:348, in _defaults_from_env_vars..insert_env_defaults(self, *args, kwargs) 345 kwargs = dict(list(env_variables.items()) + list(kwargs.items())) 347 # all args were already moved to kwargs --> 348 return fn(self, kwargs)

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:420, in Trainer.init(self, logger, enable_checkpointing, callbacks, default_root_dir, gradient_clip_val, gradient_clip_algorithm, num_nodes, num_processes, devices, gpus, auto_select_gpus, tpu_cores, ipus, enable_progress_bar, overfit_batches, track_grad_norm, check_val_every_n_epoch, fast_dev_run, accumulate_grad_batches, max_epochs, min_epochs, max_steps, min_steps, max_time, limit_train_batches, limit_val_batches, limit_test_batches, limit_predict_batches, val_check_interval, log_every_n_steps, accelerator, strategy, sync_batchnorm, precision, enable_model_summary, num_sanity_val_steps, resume_from_checkpoint, profiler, benchmark, deterministic, reload_dataloaders_every_n_epochs, auto_lr_find, replace_sampler_ddp, detect_anomaly, auto_scale_batch_size, plugins, amp_backend, amp_level, move_metrics_to_cpu, multiple_trainloader_mode, inference_mode) 417 # init connectors 418 self._data_connector = DataConnector(self, multiple_trainloader_mode) --> 420 self._accelerator_connector = AcceleratorConnector( 421 num_processes=num_processes, 422 devices=devices, 423 tpu_cores=tpu_cores, 424 ipus=ipus, 425 accelerator=accelerator, 426 strategy=strategy, 427 gpus=gpus, 428 num_nodes=num_nodes, 429 sync_batchnorm=sync_batchnorm, 430 benchmark=benchmark, 431 replace_sampler_ddp=replace_sampler_ddp, 432 deterministic=deterministic, 433 auto_select_gpus=auto_select_gpus, 434 precision=precision, 435 amp_type=amp_backend, 436 amp_level=amp_level, 437 plugins=plugins, 438 ) 439 self._logger_connector = LoggerConnector(self) 440 self._callback_connector = CallbackConnector(self)

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:207, in AcceleratorConnector.init(self, devices, num_nodes, accelerator, strategy, plugins, precision, amp_type, amp_level, sync_batchnorm, benchmark, replace_sampler_ddp, deterministic, auto_select_gpus, num_processes, tpu_cores, ipus, gpus) 204 self._set_parallel_devices_and_init_accelerator() 206 # 3. Instantiate ClusterEnvironment --> 207 self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment() 209 # 4. Instantiate Strategy - Part 1 210 if self._strategy_flag in (None, "auto"):

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:612, in AcceleratorConnector._choose_and_init_cluster_environment(self) 604 for env_type in ( 605 SLURMEnvironment, 606 BaguaEnvironment, (...) 609 LSFEnvironment, 610 ): 611 if env_type.detect(): --> 612 return env_type() 613 return LightningEnvironment()

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:48, in SLURMEnvironment.init(self, auto_requeue, requeue_signal) 46 self.requeue_signal = requeue_signal 47 self._validate_srun_used() ---> 48 self._validate_srun_variables()

File /lab-share/Cardio-Chen-e2/Public/rongbinzheng/anaconda3/envs/spatial2/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:181, in SLURMEnvironment._validate_srun_variables() 179 ntasks = int(os.environ.get("SLURM_NTASKS", "1")) 180 if ntasks > 1 and "SLURM_NTASKS_PER_NODE" not in os.environ: --> 181 raise RuntimeError( 182 f"You set --ntasks={ntasks} in your SLURM bash script, but this variable is not supported." 183 f" HINT: Use --ntasks-per-node={ntasks} instead." 184 )

RuntimeError: You set --ntasks=4 in your SLURM bash script, but this variable is not supported. HINT: Use --ntasks-per-node=4 instead.

Version information

No response

marcovarrone commented 3 months ago

Hi @zhengrongbin, could you please share the line in the SLURM script that you use to run the python script? If you are running srun python ... it may be that you have to remove the srun before despite what lightning says :)

zhengrongbin commented 3 months ago

Hi @zhengrongbin, could you please share the line in the SLURM script that you use to run the python script? If you are running srun python ... it may be that you have to remove the srun before despite what lightning says :)

Thank you for your response. I have figured out. I was running it in a jupyter notebook under a node requested by srun. I also was running it in a sbatch file with the --ntasks given. Both gives me such error report. I overcome it by directly running the python script for cellcharter by srun python, or still sbatch file but removes --ntasks.

zhengrongbin commented 3 months ago

Hi, sorry, I got new questions. According to the tutorial at https://cellcharter.readthedocs.io/en/latest/notebooks/codex_mouse_spleen.html, I'd like to perform differential neighbor analysis by diff_nhood_enrichment. It shows that there is no this function. My cellcharter version is 0.1.2.

marcovarrone commented 2 months ago

diff_nhood_enrichment has been implemented in 0.2.0, I encourage you to update to that version