Tokenizer error "list index out of range" during mapping extension

Danysan1 commented 1 year ago

Describe the bug Under some circumstances dureing the mapping extensions stage the tokenizer throws the error IndexError: list index out of range. The error originates at bert_classifier.py line 185. This is the same error and same location inside the tokenizer of https://github.com/huggingface/tokenizers/issues/993 , which was caused by the data passed to the tokenizer.

To Reproduce I have reproduced this error with these settings:

Logs & stack trace	`max_length_for_input`	`batch_size_for_training`	Source ontology	Target ontology
link	256	16	music-representation.owl	musicClasses.owl @ 2ebb641
link	128	8	core.owl	musicClasses.owl @ ebc2d09

Expected behavior The stage and the pipeline should complete successfully

Platform:

OS: python notebook on Google Colab
Python 3.10
Transformers 4.30.2
DeepOnto 0.8.3

Lawhy commented 1 year ago

Hi @Danysan1, thanks for reporting this. Before I conduct a thorough check, may I ask if every class of your ontologies has at least one label available? I suspect this issue is caused by an empty list of class names.

Also, the default BERTMap configuration might not include necessary annotation properties available in your ontologies.

Danysan1 commented 1 year ago

Yes, there were some classes without label. I have fixed them and the pipeline completed succesfully.

It would be ideal to check if this error is present and print an explicit error before calling the tokenizer.

Similarly, an unclear error is thrown in text_semantics.py line 232 if the passed ontology has no subClassOf relationships. It would be ideal to receive an explicit error.

Lawhy commented 1 year ago

Sure, I will update this in the next release. Thanks for your feedback.

KRR-Oxford / DeepOnto

Tokenizer error "list index out of range" during mapping extension #10