Closed gromajus closed 1 year ago
Hi, thanks for your report! This seems to be a bug indeed. Unfortunately I cannot try to reproduce it right now, but my first guess is that is has to do with the fact that you're using 8 GPUs. The code has only been tested on a single GPU and might require some changes in order to work on multiple GPUs (see e.g. https://lightning.ai/docs/pytorch/1.9.4/common/lightning_module.html#validating-with-dataparallel).
Could you try to rerun your code on 1 GPU (for instance by setting CUDA_VISIBLE_DEVICES=0
)?
Hi Felix! Indeed that was the case, thank you:) When I run the following code:
experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0")
experiment.run()
The model is trained, I can see the results, I can load the trained model and run inference.
The error I got previously didn't help me to find out what was going on:) Thanks for your work!
Hi Felix! Indeed that was the case, thank you:) When I run the following code:
experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0") experiment.run()
The model is trained, I can see the results, I can load the trained model and run inference.
The error I got previously didn't help me to find out what was going on:) Thanks for your work!
Actually this code used CPU not GPU. To use only one GPU with nerblackbox I used:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003")
experiment.run()
That's great! The changes in the linked PR make sure that a meaningful exception is raised if an experiment / training run is initiated on multiple GPUs. Thanks again!
Now, the message is clear:)
Additionally, you have another error, but import should help: from sys import exit
> found 8 GPUs. nerblackbox currently only works on a CPU or a single GPU. Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'.
stopped.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 2
1 #experiment = Experiment("conll2003_expe3", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0")
----> 2 experiment = Experiment("conll2003_expe6", model="bert-base-cased", dataset="conll2003", max_epochs=1)
File /nerblackbox/nerblackbox/api/experiment.py:63, in Experiment.__init__(self, experiment_name, from_config, model, dataset, from_preset, pytest, verbose, **kwargs_optional)
59 self.from_config = from_config
60 self.kwargs, self.hparams = self._parse_arguments(
61 model, dataset, self.from_preset, **kwargs_optional
62 )
---> 63 self._checks()
64 self.results = None
65 print(
66 f"> experiment = {experiment_name} not found, create new experiment."
67 )
File /nerblackbox/nerblackbox/api/experiment.py:351, in Experiment._checks(self)
348 if nr_gpus > 1:
349 msg = f"> found {nr_gpus} GPUs. nerblackbox currently only works on a CPU or a single GPU. " \
350 f"Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'."
--> 351 self._exit_gracefully(msg)
File /nerblackbox/nerblackbox/api/experiment.py:357, in Experiment._exit_gracefully(message)
355 print(message)
356 print("stopped.")
--> 357 exit(0)
NameError: name 'exit' is not defined
Fixed!
This is the output now:
> found 8 GPUs. nerblackbox currently only works on a CPU or a single GPU. Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'.
stopped.
An exception has occurred, use %tb to see the full traceback.
SystemExit: 0
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3516: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
I am not an expert Python developper but for me it seems fine:)
exit
should be used in the interpreter, sys.exit
in production code (see e.g. https://www.geeksforgeeks.org/python-exit-commands-quit-exit-sys-exit-and-os-_exit/). I think it's ok the way it is now.
System Info
v0.0.15 python3.8 ubuntu20.04: docker image: nvidia/cuda:12.1.1-runtime-ubuntu20.04
🐛 Describe the bug
Hi! I find your tool interesting - I also work on NER. However, I tried to run very basic example and it seems the code is not working for me:
I see the store folder in there with: datasets, experiment_configs, pretrained_models and results subfolders. Then, the conll data in successfully imported, I see the report. When I run the experiment, I receive the following log and error:
I see that the problem here is with _model_best.epoch_metrics["val"][0][metric], actually I debuged it a bit and _model_best.epoch_metrics["val"] is {}. I have added some checks to avoid similar errors but at the end when I run
experiment.get_result(metric="f1", level="entity", phase="test")
I receive:ATTENTION! no results found
.Am I doing something wrong? This code should work without any problems?