Fix german encoders - Githubissues

[x] gottbert-base using uklfr/gottbert-base, implemented but seems to require custom tokenizer/processor Update: Fixed by inheriting from Roberta tokenizer.

Error executing job with overrides: ['dataset=smartdata', 'encoder=gottbert-base', 'dataset_processor=gottbert-base', 'evaluation/dataset=nway_kshot_5_1']
Traceback (most recent call last):
File "evaluate.py", line 20, in evaluate
evaluation_results = evaluate_config(cfg)
File "/opt/conda/lib/python3.8/site-packages/fewie/eval.py", line 37, in evaluate_config
processed_dataset = dataset_processor(dataset)
File "/opt/conda/lib/python3.8/site-packages/fewie/dataset_processors/gottbert.py", line 36, in __call__
return dataset.map(self.tokenize_and_align_labels, batched=True)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1665, in map
return self._map_single(
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2016, in _map_single
batch = apply_function_on_filtered_inputs(
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1906, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/opt/conda/lib/python3.8/site-packages/fewie/dataset_processors/gottbert.py", line 39, in tokenize_and_align_labels
tokenized_inputs = self.tokenizer(
File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2368, in __call__
return self.batch_encode_plus(
File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2553, in batch_encode_plus
return self._batch_encode_plus(
File "/opt/conda/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 158, in _batch_encode_plus
assert self.add_prefix_space or not is_split_into_words, (
AssertionError: You need to instantiate RobertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.

[x] xlm-clm-ende-1024 implemented Update: ignore this encoder since it is troublesome to assign word_ids manually

Error executing job with overrides: ['dataset=smartdata', 'encoder=xlm-ende', 'dataset_processor=xlm-ende', 'evaluation/dataset=nway_kshot_5_1']
Traceback (most recent call last):
File "evaluate.py", line 20, in evaluate
evaluation_results = evaluate_config(cfg)
File "/opt/conda/lib/python3.8/site-packages/fewie/eval.py", line 37, in evaluate_config
processed_dataset = dataset_processor(dataset)
File "/opt/conda/lib/python3.8/site-packages/fewie/dataset_processors/xlm-ende.py", line 36, in __call__
return dataset.map(self.tokenize_and_align_labels, batched=True)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1665, in map
return self._map_single(
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2016, in _map_single
batch = apply_function_on_filtered_inputs(
File "/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1906, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/opt/conda/lib/python3.8/site-packages/fewie/dataset_processors/xlm-ende.py", line 50, in tokenize_and_align_labels
word_ids = tokenized_inputs.word_ids(batch_index=i)
File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 353, in word_ids
raise ValueError("word_ids() is not available when using Python-based tokenizers")
ValueError: word_ids() is not available when using Python-based tokenizers

DFKI-NLP / fewie

Fix german encoders #5