Closed jinniulema closed 1 year ago
Hi,
We first splits the input text into sentences. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L439
Then, the sentences are tokenized and truncated to max_sequence_length
.
https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L302
The tokenized sentences are fed into the model as a batch. https://github.com/dmis-lab/BERN2/blob/main/multi_ner/main.py#L732
In a nutshell, we handled long texts by splitting them into sentences and feeding the sentences into the model as a batch. Does this help?
It helps me. Thank you for your reply.
Hello, your api has set a limit of 3000 characters on the plain text. I wonder if the length of tokens after tokenized is longer than 512, how to fix it? I didn't find the corresponding code snippet in your code repository, can you provide your solution?