Cannot run experiment - Githubissues

gromajus commented 1 year ago

System Info

v0.0.15 python3.8 ubuntu20.04: docker image: nvidia/cuda:12.1.1-runtime-ubuntu20.04

🐛 Describe the bug

Hi! I find your tool interesting - I also work on NER. However, I tried to run very basic example and it seems the code is not working for me:

from nerblackbox import AnnotationTool, Store, Dataset, Experiment, Model

dataset = Dataset(name="conll2003", source="HF")
dataset.set_up()

experiment = Experiment("conll2003_expe", model="bert-base-cased", dataset="conll2003", max_epochs=1) 
experiment.run()

I see the store folder in there with: datasets, experiment_configs, pretrained_models and results subfolders. Then, the conll data in successfully imported, I see the report. When I run the experiment, I receive the following log and error:

2023/08/17 12:48:02 INFO mlflow.projects: 'conll2003_expe' does not exist. Creating a new experiment
2023/08/17 12:48:02 INFO mlflow.projects.utils: === Created directory /tmp/tmp180r92zf for downloading remote URIs passed to arguments of type 'path' ===
2023/08/17 12:48:02 INFO mlflow.projects.backend.local: === Running command 'python modules/scripts/script_run_experiment.py \
--experiment_name conll2003_expe \
--from_config 0 \
--run_name '' \
--device gpu \
--fp16 0
' in run with ID '58b11c6582554984a7fad842db2e91e3' === 
Global seed set to 43
INFO ------ >>> NER BLACK BOX VERSION: 0.0.15
INFO ------ - PARAMS -----------------------------------------
INFO ------ > experiment_name: conll2003_expe
INFO ------ > from_config:     False
INFO ------ > run_name_nr:     runA-1
INFO ------ ..
INFO ------ > available GPUs: 8
INFO ------ > device:         cuda
INFO ------ > fp16:           False
INFO ------ ..
INFO ------ > dataset_name:          conll2003
INFO ------ > annotation_scheme:     auto
INFO ------ > prune_ratio_train:     1.0
INFO ------ > prune_ratio_val:       1.0
INFO ------ > prune_ratio_test:      1.0
INFO ------ > train_on_val:          False
INFO ------ > train_on_test:         False
INFO ------ ..
INFO ------ > pretrained_model_name: bert-base-cased
INFO ------ > uncased:               False
INFO ------ ..
INFO ------ > checkpoints:           True
INFO ------ > logging_level:         info
INFO ------ > multiple_runs:         1
INFO ------ > seed:                  43
INFO ------ 
INFO ------ - HPARAMS ----------------------------------------
INFO ------ > batch_size:           16
INFO ------ > max_seq_length:       128
INFO ------ > max_epochs:           1
INFO ------ > early_stopping:       True
INFO ------ > monitor:              val_loss
INFO ------ > min_delta:            0.0
INFO ------ > patience:             0
INFO ------ > mode:                 min
INFO ------ > lr_max:               2e-05
INFO ------ > lr_warmup_epochs:     2
INFO ------ > lr_cooldown_epochs:   7
INFO ------ > lr_cooldown_restarts: True
INFO ------ > lr_schedule:          constant
INFO ------ > lr_num_cycles:        4
INFO ------ 
INFO ------ > read encoding: {}
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 3.19kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 77.7kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 1.96MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 1.34MB/s]
Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]
INFO ------ > annotation scheme found: bio
Downloading model.safetensors: 100%|██████████| 436M/436M [00:04<00:00, 107MB/s]  
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO ------ [before preprocessing] train data: 14041 examples
INFO ------ [after  preprocessing] train data: 14042 examples
INFO ------ [before preprocessing] val   data: 3250 examples
INFO ------ [after  preprocessing] val   data: 3254 examples
INFO ------ [before preprocessing] test  data: 3453 examples
INFO ------ [after  preprocessing] test  data: 3455 examples
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
Missing logger folder: /nerblackbox_tests/store/results/tensorboard/conll2003_expe/runA-1
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | BertForTokenClassification | 107 M 
-----------------------------------------------------
107 M     Trainable params
0         Non-trainable params
107 M     Total params
430.906   Total estimated model params size (MB)
Epoch 0:   0%|          | 0/136 [00:00<?, ?it/s]                           
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:  81%|████████  | 110/136 [00:21<00:05,  5.15it/s, loss=1.76, v_num=0]
Validation: 0it [00:00, ?it/s]
Validation:   0%|          | 0/26 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/26 [00:00<?, ?it/s]
Epoch 0:  82%|████████▏ | 111/136 [00:21<00:04,  5.19it/s, loss=1.76, v_num=0]
Epoch 0:  82%|████████▏ | 112/136 [00:21<00:04,  5.23it/s, loss=1.76, v_num=0]
Epoch 0:  83%|████████▎ | 113/136 [00:21<00:04,  5.28it/s, loss=1.76, v_num=0]
Epoch 0:  84%|████████▍ | 114/136 [00:21<00:04,  5.32it/s, loss=1.76, v_num=0]
Epoch 0:  85%|████████▍ | 115/136 [00:21<00:03,  5.37it/s, loss=1.76, v_num=0]
Epoch 0:  85%|████████▌ | 116/136 [00:21<00:03,  5.41it/s, loss=1.76, v_num=0]
Epoch 0:  86%|████████▌ | 117/136 [00:21<00:03,  5.45it/s, loss=1.76, v_num=0]
Epoch 0:  87%|████████▋ | 118/136 [00:21<00:03,  5.49it/s, loss=1.76, v_num=0]
Epoch 0:  88%|████████▊ | 119/136 [00:21<00:03,  5.53it/s, loss=1.76, v_num=0]
Epoch 0:  88%|████████▊ | 120/136 [00:21<00:02,  5.57it/s, loss=1.76, v_num=0]
Epoch 0:  89%|████████▉ | 121/136 [00:21<00:02,  5.61it/s, loss=1.76, v_num=0]
Epoch 0:  90%|████████▉ | 122/136 [00:21<00:02,  5.65it/s, loss=1.76, v_num=0]
Epoch 0:  90%|█████████ | 123/136 [00:21<00:02,  5.69it/s, loss=1.76, v_num=0]
Epoch 0:  91%|█████████ | 124/136 [00:21<00:02,  5.72it/s, loss=1.76, v_num=0]
Epoch 0:  92%|█████████▏| 125/136 [00:21<00:01,  5.76it/s, loss=1.76, v_num=0]
Epoch 0:  93%|█████████▎| 126/136 [00:21<00:01,  5.80it/s, loss=1.76, v_num=0]
Epoch 0:  93%|█████████▎| 127/136 [00:21<00:01,  5.84it/s, loss=1.76, v_num=0]
Epoch 0:  94%|█████████▍| 128/136 [00:21<00:01,  5.88it/s, loss=1.76, v_num=0]
Epoch 0:  95%|█████████▍| 129/136 [00:21<00:01,  5.92it/s, loss=1.76, v_num=0]
Epoch 0:  96%|█████████▌| 130/136 [00:21<00:01,  5.96it/s, loss=1.76, v_num=0]
Epoch 0:  96%|█████████▋| 131/136 [00:21<00:00,  6.00it/s, loss=1.76, v_num=0]
Epoch 0:  97%|█████████▋| 132/136 [00:21<00:00,  6.03it/s, loss=1.76, v_num=0]
Epoch 0:  98%|█████████▊| 133/136 [00:21<00:00,  6.07it/s, loss=1.76, v_num=0]
Epoch 0:  99%|█████████▊| 134/136 [00:21<00:00,  6.11it/s, loss=1.76, v_num=0]
Epoch 0:  99%|█████████▉| 135/136 [00:21<00:00,  6.15it/s, loss=1.76, v_num=0]
Epoch 0: 100%|██████████| 136/136 [00:22<00:00,  6.17it/s, loss=1.76, v_num=0]

[rank: 7] Metric val_loss improved. New best score: 1.641
[rank: 3] Metric val_loss improved. New best score: 1.659
[rank: 4] Metric val_loss improved. New best score: 1.643
[rank: 0] Metric val_loss improved. New best score: 1.647
[rank: 1] Metric val_loss improved. New best score: 1.643
[rank: 6] Metric val_loss improved. New best score: 1.654
[rank: 5] Metric val_loss improved. New best score: 1.648
[rank: 2] Metric val_loss improved. New best score: 1.642
Epoch 0, global step 110: 'val_loss' reached 1.64708 (best 1.64708), saving model to '/nerblackbox_tests/store/results/checkpoints/conll2003_expe/runA-1/epoch=0.ckpt' as top 1
Epoch 0: 100%|██████████| 136/136 [00:22<00:00,  6.11it/s, loss=1.76, v_num=0]
`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 136/136 [00:24<00:00,  5.54it/s, loss=1.76, v_num=0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO ------ ---
LOAD BEST CHECKPOINT /nerblackbox_tests/store/results/checkpoints/conll2003_expe/runA-1/epoch=0.ckpt
FOR TESTING AND DETAILED RESULTS
---
INFO ------ > read encoding: {}
INFO ------ > annotation scheme found: bio
INFO ------ [before preprocessing] train data: 14041 examples
INFO ------ [after  preprocessing] train data: 14042 examples
INFO ------ [before preprocessing] val   data: 3250 examples
INFO ------ [after  preprocessing] val   data: 3254 examples
INFO ------ [before preprocessing] test  data: 3453 examples
INFO ------ [after  preprocessing] test  data: 3455 examples
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Validation DataLoader 0: 100%|██████████| 26/26 [00:01<00:00, 16.25it/s]
INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.65
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.06
INFO ------ entity fil f1 (micro):   0.02
INFO ------ -----------------------
Validation DataLoader 0: 100%|██████████| 26/26 [00:03<00:00,  8.17it/s]
INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.66
INFO ------ token  all acc:          0.77
INFO ------ token  all f1 (micro):   0.77
INFO ------ token  fil f1 (micro):   0.04
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.64
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.05
INFO ------ entity fil f1 (micro):   0.02
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.64
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.04
INFO ------ entity fil f1 (micro):   0.02
INFO ------ -----------------------
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpo_nmlfc1'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpux4eirff'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpoo57wz9l'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpiiqgyhnc'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp__6r9_hj'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpm8955znq'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpeetr0afe'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpr423e92t'>
  _warnings.warn(warn_message, ResourceWarning)

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.64
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.06
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.65
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.04
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.65
INFO ------ token  all acc:          0.77
INFO ------ token  all f1 (micro):   0.77
INFO ------ token  fil f1 (micro):   0.05
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 VAL   ----
INFO ------ token  all loss:         1.64
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.05
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Testing DataLoader 0: 100%|██████████| 27/27 [00:01<00:00, 16.51it/s]
INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.63
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.08
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.64
INFO ------ token  all acc:          0.77
INFO ------ token  all f1 (micro):   0.77
INFO ------ token  fil f1 (micro):   0.06
INFO ------ entity fil f1 (micro):   0.02
INFO ------ -----------------------
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp32pf9l7r'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp_zwmb2gi'>
  _warnings.warn(warn_message, ResourceWarning)

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.63
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.06
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.63
INFO ------ token  all acc:          0.77
INFO ------ token  all f1 (micro):   0.77
INFO ------ token  fil f1 (micro):   0.08
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------
Testing DataLoader 0: 100%|██████████| 27/27 [00:06<00:00,  3.91it/s]
INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.62
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.09
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpm12xsxhe'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpdmnm3tg3'>
  _warnings.warn(warn_message, ResourceWarning)

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.62
INFO ------ token  all acc:          0.78
INFO ------ token  all f1 (micro):   0.78
INFO ------ token  fil f1 (micro):   0.07
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.63
INFO ------ token  all acc:          0.77
INFO ------ token  all f1 (micro):   0.77
INFO ------ token  fil f1 (micro):   0.08
INFO ------ entity fil f1 (micro):   0.03
INFO ------ -----------------------
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpuoogmiyq'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpis9ojqyd'>
  _warnings.warn(warn_message, ResourceWarning)
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpgsc4tgdh'>
  _warnings.warn(warn_message, ResourceWarning)

INFO ------ 
INFO ------ --- Epoch #0 TEST  ----
INFO ------ token  all loss:         1.65
INFO ------ token  all acc:          0.75
INFO ------ token  all f1 (micro):   0.75
INFO ------ token  fil f1 (micro):   0.05
INFO ------ entity fil f1 (micro):   0.01
INFO ------ -----------------------
/usr/lib/python3.8/tempfile.py:957: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpovlaz6ym'>
  _warnings.warn(warn_message, ResourceWarning)
INFO ------ epoch_best: 0
INFO ------ epochs: 0
Traceback (most recent call last):
  File "modules/scripts/script_run_experiment.py", line 161, in <module>
    main(_params, _log_dirs)
  File "modules/scripts/script_run_experiment.py", line 48, in main
    execute_single_run(params, hparams, log_dirs, experiment=True)
  File "/nerblackbox/nerblackbox/modules/ner_training/single_run.py", line 92, in execute_single_run
    logging_end(
  File "/nerblackbox/nerblackbox/modules/ner_training/single_run.py", line 305, in logging_end
    _model_best.epoch_metrics["val"][0][metric],
KeyError: 0
---------------------------------------------------------------------------
ExecutionException                        Traceback (most recent call last)
Cell In[6], line 1
----> 1 experiment.run()

File /nerblackbox/nerblackbox/api/experiment.py:133, in Experiment.run(self)
    130     assert self.hparams is not None, f"ERROR! self.hparams is None."
    131     self._write_config_file(hparams=self.hparams)
--> 133 mlflow.projects.run(
    134     uri=resource_filename(Requirement.parse("nerblackbox"), "nerblackbox"),
    135     entry_point="run_experiment",
    136     experiment_name=self.experiment_name,
    137     parameters=_parameters,
    138     env_manager="local",
    139 )
    141 experiment_exists, self.results = Store.get_experiment_results_single(
    142     self.experiment_name,
    143     verbose=self.verbose,
    144 )
    145 assert (
    146     experiment_exists
    147 ), f"ERROR! experiment = {self.experiment_name} does not exist."

File /usr/local/lib/python3.8/dist-packages/mlflow/projects/__init__.py:354, in run(uri, entry_point, version, parameters, docker_args, experiment_name, experiment_id, backend, backend_config, storage_dir, synchronous, run_id, run_name, env_manager, build_image, docker_auth)
    337 submitted_run_obj = _run(
    338     uri=uri,
    339     experiment_id=experiment_id,
   (...)
    351     docker_auth=docker_auth,
    352 )
    353 if synchronous:
--> 354     _wait_for(submitted_run_obj)
    355 return submitted_run_obj

File /usr/local/lib/python3.8/dist-packages/mlflow/projects/__init__.py:371, in _wait_for(submitted_run_obj)
    369     else:
    370         _maybe_set_run_terminated(active_run, "FAILED")
--> 371         raise ExecutionException("Run (ID '%s') failed" % run_id)
    372 except KeyboardInterrupt:
    373     _logger.error("=== Run (ID '%s') interrupted, cancelling run ===", run_id)

ExecutionException: Run (ID '58b11c6582554984a7fad842db2e91e3') failed

I see that the problem here is with _model_best.epoch_metrics["val"][0][metric], actually I debuged it a bit and _model_best.epoch_metrics["val"] is {}. I have added some checks to avoid similar errors but at the end when I run experiment.get_result(metric="f1", level="entity", phase="test") I receive: ATTENTION! no results found .

Am I doing something wrong? This code should work without any problems?

flxst commented 1 year ago

Hi, thanks for your report! This seems to be a bug indeed. Unfortunately I cannot try to reproduce it right now, but my first guess is that is has to do with the fact that you're using 8 GPUs. The code has only been tested on a single GPU and might require some changes in order to work on multiple GPUs (see e.g. https://lightning.ai/docs/pytorch/1.9.4/common/lightning_module.html#validating-with-dataparallel).

Could you try to rerun your code on 1 GPU (for instance by setting CUDA_VISIBLE_DEVICES=0)?

gromajus commented 1 year ago

Hi Felix! Indeed that was the case, thank you:) When I run the following code:

experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0") 
experiment.run()

The model is trained, I can see the results, I can load the trained model and run inference.

The error I got previously didn't help me to find out what was going on:) Thanks for your work!

gromajus commented 1 year ago

Hi Felix! Indeed that was the case, thank you:) When I run the following code:
experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0") 
experiment.run()
The model is trained, I can see the results, I can load the trained model and run inference.

The error I got previously didn't help me to find out what was going on:) Thanks for your work!

Actually this code used CPU not GPU. To use only one GPU with nerblackbox I used:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
experiment = Experiment("conll2003_expe4", model="bert-base-cased", dataset="conll2003") 
experiment.run()

flxst commented 1 year ago

That's great! The changes in the linked PR make sure that a meaningful exception is raised if an experiment / training run is initiated on multiple GPUs. Thanks again!

gromajus commented 1 year ago

Now, the message is clear:) Additionally, you have another error, but import should help: from sys import exit

> found 8 GPUs. nerblackbox currently only works on a CPU or a single GPU. Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'.
stopped.
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 2
      1 #experiment = Experiment("conll2003_expe3", model="bert-base-cased", dataset="conll2003", max_epochs=1, device="cuda:0") 
----> 2 experiment = Experiment("conll2003_expe6", model="bert-base-cased", dataset="conll2003", max_epochs=1) 

File /nerblackbox/nerblackbox/api/experiment.py:63, in Experiment.__init__(self, experiment_name, from_config, model, dataset, from_preset, pytest, verbose, **kwargs_optional)
     59 self.from_config = from_config
     60 self.kwargs, self.hparams = self._parse_arguments(
     61     model, dataset, self.from_preset, **kwargs_optional
     62 )
---> 63 self._checks()
     64 self.results = None
     65 print(
     66     f"> experiment = {experiment_name} not found, create new experiment."
     67 )

File /nerblackbox/nerblackbox/api/experiment.py:351, in Experiment._checks(self)
    348 if nr_gpus > 1:
    349     msg = f"> found {nr_gpus} GPUs. nerblackbox currently only works on a CPU or a single GPU. " \
    350           f"Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'."
--> 351     self._exit_gracefully(msg)

File /nerblackbox/nerblackbox/api/experiment.py:357, in Experiment._exit_gracefully(message)
    355 print(message)
    356 print("stopped.")
--> 357 exit(0)

NameError: name 'exit' is not defined

flxst commented 1 year ago

Fixed!

gromajus commented 1 year ago

This is the output now:

> found 8 GPUs. nerblackbox currently only works on a CPU or a single GPU. Try for instance os.environ['CUDA_VISIBLE_DEVICES'] = '0'.
stopped.
An exception has occurred, use %tb to see the full traceback.

SystemExit: 0

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3516: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

I am not an expert Python developper but for me it seems fine:)

flxst commented 1 year ago

exit should be used in the interpreter, sys.exit in production code (see e.g. https://www.geeksforgeeks.org/python-exit-commands-quit-exit-sys-exit-and-os-_exit/). I think it's ok the way it is now.

flxst / nerblackbox

Cannot run experiment #5

System Info

🐛 Describe the bug