ebanalyse / NERDA

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks
MIT License
153 stars 35 forks source link

getting error while trying to train NERDA on conll_data #22

Closed subhadip10 closed 2 years ago

subhadip10 commented 3 years ago

transformers.version == '4.1.0.dev0' torch.version == '1.7.0+cu92'

Code

import numpy as np from NERDA.datasets import get_conll_data, download_conll_data download_conll_data() training = get_conll_data('train') validation = get_conll_data('valid')

transformer = 'bert-base-uncased'

from NERDA.models import NERDA model = NERDA( dataset_training = training, dataset_validation = validation, tag_scheme = tag_scheme, tag_outside = 'O', transformer = transformer, dropout = dropout, hyperparameters = training_hyperparameters )

model.train()


TypeError Traceback (most recent call last)

in () ----> 1 model.train() /libs/project/NERDA/models.py in train(self) 203 device = self.device, 204 num_workers = self.num_workers, --> 205 **self.hyperparameters) 206 207 # attach as attributes to class /libs/project/NERDA/training.py in train_model(network, tag_encoder, tag_outside, transformer_tokenizer, transformer_config, dataset_training, dataset_validation, max_len, train_batch_size, validation_batch_size, epochs, warmup_steps, learning_rate, device, fixed_seed, num_workers) 154 print('\n Epoch {:} / {:}'.format(epoch + 1, epochs)) 155 --> 156 train_loss = train(network, dl_train, optimizer, device, scheduler, n_tags) 157 train_losses.append(train_loss) 158 valid_loss = validate(network, dl_validate, device, n_tags) /libs/project/NERDA/training.py in train(model, data_loader, optimizer, device, scheduler, n_tags) 13 final_loss = 0.0 14 ---> 15 for dl in tqdm(data_loader, total=len(data_loader)): 16 17 optimizer.zero_grad() ~/libraries/nb_env/lib64/python3.6/site-packages/tqdm/std.py in __iter__(self) 1165 1166 try: -> 1167 for obj in iterable: 1168 yield obj 1169 # Update and possibly print the progressbar. ~/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self) 433 if self._sampler_iter is None: 434 self._reset() --> 435 data = self._next_data() 436 self._num_yielded += 1 437 if self._dataset_kind == _DatasetKind.Iterable and \ ~/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self) 1083 else: 1084 del self._task_info[idx] -> 1085 return self._process_data(data) 1086 1087 def _try_put_index(self): ~/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data) 1109 self._try_put_index() 1110 if isinstance(data, ExceptionWrapper): -> 1111 data.reraise() 1112 return data 1113 ~/libraries/nb_env/lib64/python3.6/site-packages/torch/_utils.py in reraise(self) 426 # have message field 427 raise self.exc_type(message=msg) --> 428 raise self.exc_type(msg) 429 430 TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/libs/project/NERDA/preprocessing.py", line 81, in __getitem__ input_ids = self.transformer_tokenizer.encode(tokens) File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2127, in encode **kwargs, File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2452, in encode_plus **kwargs, File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 465, in _encode_plus **kwargs, File "/home/jupyter/libraries/nb_env/lib64/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 378, in _batch_encode_plus is_pretokenized=is_split_into_words, TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
smaakage85 commented 2 years ago

Hi @subhadip10

Thanks for the feedback.

Your code is not 100% complete, I have tried to fill in the gaps in the code snippet below.

With the latest release of NERDA (0.9.7) I am not able reproduce the error.

Please try upgrading NERDA and check if the error persists. If so, please get back to me, and I will take a look at it. Until then, I will close this issue.

Best, Lars

from NERDA.datasets import get_conll_data, download_conll_data
download_conll_data()
training = get_conll_data('train')
validation = get_conll_data('valid')

transformer = 'bert-base-uncased'

tag_scheme = ['B-PER',
              'I-PER', 
              'B-ORG', 
              'I-ORG', 
              'B-LOC', 
              'I-LOC', 
              'B-MISC', 
              'I-MISC']

hyperparameters = {'epochs' : 3,
                   'warmup_steps' : 10,
                   'train_batch_size': 5,
                   'learning_rate': 0.0001}

from NERDA.models import NERDA
model = NERDA(
dataset_training = training,
dataset_validation = validation,
tag_scheme = tag_scheme,
tag_outside = 'O',
transformer = transformer,
hyperparameters = hyperparameters
)

model.train()