Babelscape / rebel

REBEL is a seq2seq model that simplifies Relation Extraction (EMNLP 2021).
505 stars 73 forks source link

KeyError: 'labels' in generate_samples.py; doc_red #39

Closed l0renor closed 2 years ago

l0renor commented 2 years ago

Hi, I am trying to train the model on the doc red dataset in order to test the effects of labeling the entities with an additional special token.

At the moment I am still trying to get the code to run with the original dataset.

In the first epoch after 56% i get the KeyError: 'labels' in line 48, in on_train_batch_end labels = batch.pop("labels")

I checked the dataset for empty labels and found 27 empty arrays in the doc red data. Deleting data points didn't solve the problem. I also tested only using the first 50% of the dataset. The error still occurred at 56%.

full console output with print(batch) before the error:

(azureml_py38_PT_TF) azureuser@rebelgpu:/mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/src$ python train.py 
Extension horovod.torch has not been built: /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still available.
Global seed set to 42
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
[2022-08-29 11:47:49,710][datasets.builder][WARNING] - Using custom data configuration default-3b456a334ae5426f
[2022-08-29 11:47:49,710][datasets.builder][WARNING] - Reusing dataset doc_red (/home/azureuser/.cache/huggingface/datasets/doc_red/default-3b456a334ae5426f/0.0.0/2cc6999b276b6aa2b2af5101b416c33155e5f19e6f0b26864a2312d1aa57b175)
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
[2022-08-29 11:47:50,688][datasets.arrow_dataset][WARNING] - Loading cached processed dataset at /mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/data/doc_red/train_annotated.jsondocred_typed.cache
[2022-08-29 11:47:51,828][datasets.arrow_dataset][WARNING] - Loading cached processed dataset at /mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/data/doc_red/dev.jsondocred_typed.cache
wandb: Currently logged in as: llukas (use `wandb login --relogin` to force relogin)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: wandb version 0.13.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run bart-large
wandb: ⭐️ View project at https://wandb.ai/llukas/docred_typed
wandb: πŸš€ View run at https://wandb.ai/llukas/docred_typed/runs/3b8sf4s1
wandb: Run data is saved locally in /mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/src/outputs/2022-08-29/11-47-34/wandb/run-20220829_114753-3b8sf4s1
wandb: Run `wandb offline` to turn off syncing.

  | Name    | Type                         | Params
---------------------------------------------------------
0 | model   | BartForConditionalGeneration | 406 M 
1 | loss_fn | CrossEntropyLoss             | 0     
---------------------------------------------------------
406 M     Trainable params
0         Non-trainable params
406 M     Total params
/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                                                                                                              | 0/2 [00:00<?, ?it/s]/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/transformers/generation_utils.py:1777: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  next_indices = next_tokens // vocab_size
Validation sanity check: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:11<00:00,  5.61s/it]RE Evaluation in *** STRICT *** mode
processed 16 sentences with 233 relations; found: 0 relations; correct: 0.
        ALL      TP: 0; FP: 0;  FN: 231
                (m avg): precision: 0.00;       recall: 0.00;   f1: 0.00 (micro)
                (M avg): precision: 0.00;       recall: 0.00;   f1: 0.00 (Macro)

        head of government:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        country:        TP: 0;  FP: 0;  FN: 64; precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        place of birth:         TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        place of death:         TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        father:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        mother:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        spouse:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        country of citizenship:         TP: 0;  FP: 0;  FN: 10; precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        continent:      TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        instance of:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        head of state:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        capital:        TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        official language:      TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        position held:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        child:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        author:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        member of sports team:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        director:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        screenwriter:   TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        educated at:    TP: 0;  FP: 0;  FN: 5;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        composer:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        member of political party:      TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        employer:       TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        founded by:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        league:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        publisher:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        owned by:       TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        located in the administrative territorial entity:       TP: 0;  FP: 0;  FN: 33; precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        genre:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        operator:       TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        religion:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        contains administrative territorial entity:     TP: 0;  FP: 0;  FN: 27; precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        follows:        TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        followed by:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        headquarters location:  TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        cast member:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        producer:       TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        award received:         TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        creator:        TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        parent taxon:   TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        ethnic group:   TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        performer:      TP: 0;  FP: 0;  FN: 6;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        manufacturer:   TP: 0;  FP: 0;  FN: 14; precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        developer:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        series:         TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        sister city:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        legislative body:       TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        basin country:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        located in or next to body of wate/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
r:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        military branch:        TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        record label:   TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        production company:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        location:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        subclass of:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        subsidiary:     TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        part of:        TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        original language of work:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        platform:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        mouth of the watercourse:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        original network:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        member of:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        chairperson:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        country of origin:      TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        has part:       TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        residence:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        date of birth:  TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        date of death:  TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        inception:      TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        dissolved, abolished or demolished:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        publication date:       TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        start time:     TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        end time:       TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        point in time:  TP: 0;  FP: 0;  FN: 1;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        conflict:       TP: 0;  FP: 0;  FN: 4;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        characters:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        lyrics by:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        located on terrain feature:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        participant:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        influenced by:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        location of formation:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        parent organization:    TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        notable work:   TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        separated from:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        narrative location:     TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        work location:  TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        applies to jurisdiction:        TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        product or material produced:   TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        unemployment rate:      TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        territory claimed by:   TP: 0;  FP: 0;  FN: 3;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        participant of:         TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        replaces:       TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        replaced by:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        capital of:     TP: 0;  FP: 0;  FN: 2;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        languages spoken, written or signed:    TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        present in work:        TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
        sibling:        TP: 0;  FP: 0;  FN: 0;  precision: 0.00;        recall: 0.00;   f1: 0.00;       0
Epoch 0:   1%|β–ˆβ–‹                                                                                                                                                                                           | 8/889 [00:02<04:25,  3.32it/s, loss=8.86, v_num=f4s1]/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0:  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                                                                                  | 499/889 [02:39<02:04,  3.12it/s, loss=5.25, v_num=f4s1]------------------------------------------------
{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'), 'input_ids': tensor([[    0, 29161,  2897,  ...,     1,     1,     1],
        [    0,   133,   494,  ...,     1,     1,     1],
        [    0, 47001,   329,  ...,     1,     1,     1],
        [    0,   113,  1890,  ...,   347,     4,     2]], device='cuda:0'), 'decoder_input_ids': tensor([[    0, 50267,  2897,  ...,     1,     1,     1],
        [    0, 50267,   496,  ...,     1,     1,     1],
        [    0, 50267, 18775,  ...,     1,     1,     1],
        [    0, 50267,  1890,  ..., 13034,  1437,     2]], device='cuda:0')}
<bound method BatchEncoding.keys of {'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'), 'input_ids': tensor([[    0, 29161,  2897,  ...,     1,     1,     1],
        [    0,   133,   494,  ...,     1,     1,     1],
        [    0, 47001,   329,  ...,     1,     1,     1],
        [    0,   113,  1890,  ...,   347,     4,     2]], device='cuda:0'), 'decoder_input_ids': tensor([[    0, 50267,  2897,  ...,     1,     1,     1],
        [    0, 50267,   496,  ...,     1,     1,     1],
        [    0, 50267, 18775,  ...,     1,     1,     1],
        [    0, 50267,  1890,  ..., 13034,  1437,     2]], device='cuda:0')}>
Saving latest checkpoint...
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 556, in run_training_epoch
    self.on_train_batch_end(epoch_output, batch_end_outputs, batch, batch_idx, dataloader_idx)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 226, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', batch_end_outputs, batch, batch_idx, dataloader_idx)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
    trainer_hook(*args, **kwargs)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/src/generate_samples.py", line 48, in on_train_batch_end
    labels = batch.pop("labels")
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/_collections_abc.py", line 795, in pop
    value = self[key]
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 230, in __getitem__
    return self.data[item]
KeyError: 'labels'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 111, in <module>
    main()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/hydra/core/utils.py", line 127, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 107, in main
    train(conf)
  File "train.py", line 103, in train
    trainer.fit(pl_module, datamodule=pl_data_module)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
    return self.train_or_test()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 592, in train
    self.train_loop.on_train_end()
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 156, in on_train_end
    self.check_checkpoint_callback(should_save=True, is_last=True)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 190, in check_checkpoint_callback
    callback.on_validation_end(self.trainer, model)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 239, in save_checkpoint
    self._validate_monitor_key(trainer)
  File "/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 517, in _validate_monitor_key
    raise MisconfigurationException(m)
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val_F1_micro') not found in the returned metrics: ['loss']. HINT: Did you call self.log('val_F1_micro', tensor) in the LightningModule?

wandb: Waiting for W&B process to finish, PID 13914
wandb: Program failed with code 1.  Press ctrl-c to abort syncing.
wandb:                                                                                
wandb: Find user logs for this run at: /mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/src/outputs/2022-08-29/11-47-34/wandb/run-20220829_114753-3b8sf4s1/logs/debug.log
wandb: Find internal logs for this run at: /mnt/batch/tasks/shared/LS_root/mounts/clusters/rebelgpu/code/Users/leon.lukas/rebel-main/src/outputs/2022-08-29/11-47-34/wandb/run-20220829_114753-3b8sf4s1/logs/debug-internal.log
wandb: Run summary:
wandb:   lr-AdamW/pg1 0.0
wandb:   lr-AdamW/pg2 0.0
wandb:           loss 5.46171
wandb:          epoch 0
wandb:       _runtime 174
wandb:     _timestamp 1661773847
wandb:          _step 49
wandb: Run history:
wandb:   lr-AdamW/pg1 ▁
wandb:   lr-AdamW/pg2 ▁
wandb:           loss ▁
wandb:          epoch ▁
wandb:       _runtime ▁
wandb:     _timestamp ▁
wandb:          _step ▁
wandb: 
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
LittlePea13 commented 2 years ago

Same issue as https://github.com/Babelscape/rebel/issues/29

Please check there.

Best, Pere-Lluis.

l0renor commented 2 years ago

Thanks for the quick reply. This unfortunately doesn’t fix the issue for me. In #29 the problem got resolved by setting pytorch-lightning verision to 1.1.7. I was already using this version as specified in the requirements.txt.

As I ran into the same issue as #28 I changed the requirements.txt. For me it only starts training with the following file:

omegaconf==2.0.6
pytorch-lightning==1.1.7
hydra-core==1.0.6
transformers
neptune-client==0.5.1
psutil==5.8.0
datasets==1.3.0
rouge-score==0.0.4
sacrebleu==1.5.0
wandb==0.10.26
streamlit==0.82.0

Which versions would you recommend @LittlePea13?

LittlePea13 commented 2 years ago

with the latest commit the issue with the labels should be fixed, I will update the reqs with a newer version of datasets but it needs a small change to the dataset files.

l0renor commented 2 years ago

Thank you!