dmis-lab / BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
http://bern2.korea.ac.kr
BSD 2-Clause "Simplified" License
174 stars 41 forks source link

Longer than 512 tokens #48

Closed jinniulema closed 1 year ago

jinniulema commented 1 year ago

Hello, your api has set a limit of 3000 characters on the plain text. I wonder if the length of tokens after tokenized is longer than 512, how to fix it? I didn't find the corresponding code snippet in your code repository, can you provide your solution?

mjeensung commented 1 year ago

Hi,

  1. We first splits the input text into sentences. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L439

  2. Then, the sentences are tokenized and truncated to max_sequence_length. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L302

  3. The tokenized sentences are fed into the model as a batch. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L732

In a nutshell, we handled long texts by splitting them into sentences and feeding the sentences into the model as a batch. Does this help?

jinniulema commented 1 year ago

It helps me. Thank you for your reply.