dh-benchmark - Githubissues

I'm trying to use the DH benchmark from this year's OAEI. I get the error below. Do you have any idea what is going wrong? I also included the config.json and configMatcher.json. To test I only added two ontologies and one reference to the data/dh folders.

/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
finished imports

computing similarities for defc X pactols:   0%|          | 0/1 [00:00<?, ?it/s]
computing similarities for defc X pactols: 100%|██████████| 1/1 [00:03<00:00,  3.03s/it]
computing similarities for defc X pactols: 100%|██████████| 1/1 [00:03<00:00,  3.03s/it]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/pytorch_lightning/utilities/migration/migration.py:208: You have multiple `ModelCheckpoint` callback states in this checkpoint, but we found state keys that would end up colliding with each other after an upgrade, which means we can't differentiate which of your checkpoint callbacks needs which states. At least one of your `ModelCheckpoint` callbacks will not be able to reload the state.
Lightning automatically upgraded your loaded checkpoint from v0.9.0 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint verbalizer/graph2text/outputs/t5-base_13881/val_avg_bleu=68.1000-step_count=5.ckpt`
/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
We have added 3 tokens
parameters Namespace(logger=True, checkpoint_callback=True, early_stop_callback=False, default_root_dir=None, gradient_clip_val=0, process_position=0, num_nodes=1, num_processes=1, gpus=1, auto_select_gpus=False, log_gpu_memory=None, progress_bar_refresh_rate=1, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=1, max_epochs=100, min_epochs=1, max_steps=None, min_steps=None, limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0, val_check_interval=1.0, log_save_interval=100, row_log_interval=50, distributed_backend=None, sync_batchnorm=False, precision=32, weights_summary='top', weights_save_path=None, num_sanity_val_steps=2, truncated_bptt_steps=None, resume_from_checkpoint=None, profiler=None, benchmark=False, deterministic=False, reload_dataloaders_every_epoch=False, auto_lr_find=False, replace_sampler_ddp=True, terminate_on_nan=False, auto_scale_batch_size=False, prepare_data_per_node=True, amp_backend='native', amp_level='O2', val_percent_check=None, test_percent_check=None, train_percent_check=None, overfit_pct=None, model_name_or_path='t5-base', config_name='', tokenizer_name=None, cache_dir='', encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, learning_rate=3e-05, lr_scheduler='linear', weight_decay=0.0, adam_epsilon=1e-08, warmup_steps=0, num_workers=4, train_batch_size=4, eval_batch_size=4, adafactor=False, output_dir='./verbalizer/graph2text/outputs/test_model', fp16=False, fp16_opt_level='O2', do_train=True, do_predict=True, seed=42, data_dir='./verbalizer/graph2text/data/webnlg', max_source_length=384, max_target_length=384, val_max_target_length=384, test_max_target_length=384, freeze_encoder=False, freeze_embeds=False, sortish_sampler=False, max_tokens_per_batch=None, logger_name='default', n_train=-1, n_val=-1, n_test=-1, task='graph2text', label_smoothing=0.0, src_lang='', tgt_lang='', eval_beams=3, checkpoint=None, val_metric=None, eval_max_gen_length=384, save_top_k=1, early_stopping_patience=15, git_sha='')
/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['model.decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
exported similarities for (defc X pactols) to  ../results/result_similarities/dh/defc-pactols.json
exported exact matches to ../results/result_exactMatches/dh/defc-pactols.json
exported random walk triples of defc to ../results/result_triples/dh/triples_randomWalk_defc.json
exported random walk triples of pactols to ../results/result_triples/dh/triples_randomWalk_pactols.json
exported random tree triples of defc to ../results/result_triples/dh/triples_randomTree_defc.json
exported random tree triples of pactols to ../results/result_triples/dh/triples_randomTree_pactols.json
CUDA NOT AVAILABLE
Graph2Text hparams are: Namespace(logger=True, checkpoint_callback=True, early_stop_callback=False, default_root_dir=None, gradient_clip_val=0, process_position=0, num_nodes=1, num_processes=1, gpus=1, auto_select_gpus=False, log_gpu_memory=None, progress_bar_refresh_rate=1, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=1, max_epochs=100, min_epochs=1, max_steps=None, min_steps=None, limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0, val_check_interval=1.0, log_save_interval=100, row_log_interval=50, distributed_backend=None, sync_batchnorm=False, precision=32, weights_summary='top', weights_save_path=None, num_sanity_val_steps=2, truncated_bptt_steps=None, resume_from_checkpoint=None, profiler=None, benchmark=False, deterministic=False, reload_dataloaders_every_epoch=False, auto_lr_find=False, replace_sampler_ddp=True, terminate_on_nan=False, auto_scale_batch_size=False, prepare_data_per_node=True, amp_backend='native', amp_level='O2', val_percent_check=None, test_percent_check=None, train_percent_check=None, overfit_pct=None, model_name_or_path='t5-base', config_name='', tokenizer_name=None, cache_dir='', encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, learning_rate=3e-05, lr_scheduler='linear', weight_decay=0.0, adam_epsilon=1e-08, warmup_steps=0, num_workers=4, train_batch_size=4, eval_batch_size=4, adafactor=False, output_dir='./verbalizer/graph2text/outputs/test_model', fp16=False, fp16_opt_level='O2', do_train=True, do_predict=True, seed=42, data_dir='./verbalizer/graph2text/data/webnlg', max_source_length=384, max_target_length=384, val_max_target_length=384, test_max_target_length=384, freeze_encoder=False, freeze_embeds=False, sortish_sampler=False, max_tokens_per_batch=None, logger_name='default', n_train=-1, n_val=-1, n_test=-1, task='graph2text', label_smoothing=0.0, src_lang='', tgt_lang='', eval_beams=3, checkpoint=None, val_metric=None, eval_max_gen_length=384, save_top_k=1, early_stopping_patience=15, git_sha='')
start generating "../results/result_triplesVerbalized/dh/verbalized_triples_randomWalk_defc.json"

Verbalizing: 0item [00:00, ?item/s]
Verbalizing: 0item [00:00, ?item/s]
saved ../results/result_triplesVerbalized/dh/verbalized_triples_randomWalk_defc.json
start generating "../results/result_triplesVerbalized/dh/verbalized_triples_randomWalk_pactols.json"

Verbalizing: 0item [00:00, ?item/s]
Verbalizing: 0item [00:00, ?item/s]
saved ../results/result_triplesVerbalized/dh/verbalized_triples_randomWalk_pactols.json
start generating "../results/result_triplesVerbalized/dh/verbalized_triples_randomTree_pactols.json"

Verbalizing:   0%|          | 0/1 [00:00<?, ?item/s]/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:4252: FutureWarning: 
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  warnings.warn(formatted_warning, FutureWarning)

Verbalizing:   0%|          | 0/1 [00:00<?, ?item/s]
['translate Graph to English: ']
ERROR VERBALISING translate Graph to English: 
Traceback (most recent call last):
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/run_matcher.py", line 362, in <module>
    main()
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/run_matcher.py", line 223, in main
    tripleVerbalizer.verbaliseFile(tripleFilePath, tripleVerbalizedFilePath)
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/verbalizer/tripleVerbalizer.py", line 33, in verbaliseFile
    verbalised_text = verbalise(triples, verb_module)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/verbalizer/tripleVerbalizer.py", line 16, in verbalise
    return verbModule.verbalise(ans)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/verbalizer/verbalisation_module.py", line 143, in verbalise
    return self.verbalise_sentence(input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/verbalizer/verbalisation_module.py", line 103, in verbalise_sentence
    gen_output = self.__generate_verbalisations_from_inputs(inputs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/git/oaei/Matcher/OtherMatcher/OntoMatch/src/verbalizer/verbalisation_module.py", line 46, in __generate_verbalisations_from_inputs
    gen_output = self.g2t_module.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/transformers/generation/utils.py", line 1713, in generate
    self._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)
  File "/Users/username/.virtualenvs/ontomatch/lib/python3.12/site-packages/transformers/generation/utils.py", line 1556, in _prepare_special_tokens
    raise ValueError(
ValueError: `decoder_start_token_id` or `bos_token_id` has to be defined for encoder-decoder generation.

config.json

{
    "General": {
        "track": "dh",
        "general_fine_tuned_path": "./store/general_fine_tune/",
        "general_pair_tuned_path": "./store/general_pair_tune/",
        "general_pair_tuned_path_2": "./store/general_path_2/",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "///model": "gsarti/biobert-nli",
        "//model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",

        "//tokenizer": "emilyalsentzer/Bio_ClinicalBERT",

        "metrics_folder": "./OntologyAlignment/tensorboard/"
    },
    "dh": {
            "ontologies_folder" : "../data/dh/ontologies",
            "alignments_folder" : "../data/dh/alignments",

            "parsing_parameters":
            {
            "use_label": 1,
            "use_synonyms": 1,
            "autocorrect": 0,
            "synonym_extension": 0,
            "subclass_of_properties": ["UNDEFINED_part_of"] 
            }
        }            
}

configMatcher.json

{
    "reformatThisFile": false,
    "resetThisFile": false,
    "importOntologies": true,
    "computeSimilarities": true,
    "matchExactMatches": true,
    "thresholdForExactMatches": 0.95,
    "runRandomWalkAlgorithm": true,
    "runRandomTreeAlgorithm": true,
    "randomTreeConfig": {
        "breadth": 2,
        "path_depth": 3,
        "parent_prob": 28,
        "child_prob": 28,
        "equivalent_prob": 28,
        "object_prob": 16
    },
    "verbalizeAvailableTriples": true,
    "promptVersions": [
        0,
        1,
        2,
        3
    ],
    "generateWalkPrompts": true,
    "generateTreePrompts": true,
    "runAllPromptsOnLLM": false,
    "runMissingPromptsOnDemandAndMatch": true,
    "thresholdForConsideration": 0.4,
    "neighborhoodRange": 2,
    "exportFinalMatchingsToRDF": true,
    "track": "dh",
    "similarityPath": "../results/result_similarities/dh/",
    "exactMatchPath": "../results/result_exactMatches/dh/",
    "triplesPath": "../results/result_triples/dh/",
    "triplesVerbalizedPath": "../results/result_triplesVerbalized/dh/",
    "promptsPath": "../results/result_prompts/dh/",
    "llmOutcomePath": "../results/result_llmOutcome/dh/",
    "bipartiteMatchingPath": "../results/result_bipartiteMatching/dh/",
    "rdfPath": "../results/result_RDF/dh/"
}

JulianSampels / OntoMatch

dh-benchmark #21