Longer than 512 tokens - Githubissues

jinniulema commented 1 year ago

Hello, your api has set a limit of 3000 characters on the plain text. I wonder if the length of tokens after tokenized is longer than 512, how to fix it? I didn't find the corresponding code snippet in your code repository, can you provide your solution?

mjeensung commented 1 year ago

Hi,

We first splits the input text into sentences. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L439
Then, the sentences are tokenized and truncated to max_sequence_length. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L302
The tokenized sentences are fed into the model as a batch. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L732

In a nutshell, we handled long texts by splitting them into sentences and feeding the sentences into the model as a batch. Does this help?

jinniulema commented 1 year ago

It helps me. Thank you for your reply.

dmis-lab / BERN2

Longer than 512 tokens #48