ebanalyse / NERDA

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks
MIT License
153 stars 35 forks source link

tiny bug in NERDADataSetReader! #1

Closed meti-94 closed 3 years ago

meti-94 commented 3 years ago

Hi there! In some cases, there is an error raised during iterating over DataLoder's batches and I believe, it is happened because of offset's list length! The error is like this:

RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [150] at entry 0 and [151] at entry 1

A quick and Unprincipled solution to fix it can be adding an extra line of code to truncate the list in class NERDADataSetReader(), this is worked for me! :)

offsets = offsets[:self.max_len]

smaakage85 commented 3 years ago

Hi @meti-94

Thank you so much for your feedback. I will look into it asap :)

can you provide the sentence + tags, that triggered the error? Or even better some code, that reproduces error? /L

smaakage85 commented 3 years ago

@meti-94 can you help with some code to reproduce this error? Please :S

meti-94 commented 3 years ago

Hi, So sorry for the late answer, I am adding a text file that contains several records and link to my implementation, change the data source and see the error, please :)

RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [150] at entry 0 and [151] at entry 6

Implementation: https://github.com/meti-94/TextClassification/blob/main/BERT_NER.ipynb Sample data: sample.txt

liuhh02 commented 3 years ago

Hi there! In some cases, there is an error raised during iterating over DataLoder's batches and I believe, it is happened because of offset's list length! The error is like this:

RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in default_collate return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 73, in <dictcomp> return {key: default_collate([d[key] for d in batch]) for key in elem} File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [150] at entry 0 and [151] at entry 1

A quick and Unprincipled solution to fix it can be adding an extra line of code to truncate the list in class NERDADataSetReader(), this is worked for me! :)

offsets = offsets[:self.max_len]

Hello! I'm facing the same error as well. May I know where you are adding the line offsets = offsets[:self.max_len] to fix the error? Is it after line 96 https://github.com/ebanalyse/NERDA/blob/main/src/NERDA/preprocessing.py#L96?

meti-94 commented 3 years ago

As I said before, that was not the best solution :)

liuhh02 commented 3 years ago

As I said before, that was not the best solution :)

Yeah I understand, do you have any ideas why this is happening? And do you perhaps know of a better solution? (And just as a quick fix, it is to add the line of code after line 96 right?)

meti-94 commented 3 years ago

As a quick fix, I suggest adding a simple truncation statement after the 96th line! I would analyze the code to find a better solution as soon as possible! this repo could help me a lot! :)

smaakage85 commented 3 years ago

Hi @meti-94 and @liuhh02 . Sorry to have kept you hanging for so long! Can you help with providing just an example of a sentence, that causes this error?

smaakage85 commented 3 years ago

... and if you have an idea for a solution, I would love, if you would make a Pull Request! :)

NicholasJallan commented 3 years ago

Hi there ! I have the very same issue with a custom training/validation set, even if it seems to have a proper size. I would be very pleased if this problem could be fixed.

meti-94 commented 3 years ago

Hi again :) I came up with an explanation and a simple solution for the issue, you can check it out here!

smaakage85 commented 3 years ago

pull request merged! : )