deeppavlov / DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.
https://deeppavlov.ai
Apache License 2.0
6.65k stars 1.14k forks source link

NER - “input sequence after bert tokenization shouldn’t exceed 512 tokens” (ner_bert_base) #1686

Open ghnp5 opened 4 months ago

ghnp5 commented 4 months ago

DeepPavlov version: The latest docker container deeppavlov/deeppavlov, published last month

Python version: 3.10

Operating system: The latest docker container deeppavlov/deeppavlov, published last month. Docker is running on CentOS/AlmaLinux.

Issue:

I’m looking to understand how to prevent this crash from happening.

input sequence after bert tokenization shouldn’t exceed 512 tokens.

I’m using the REST API, so I’m calling ner_bert_base like this:

{
  "x": [
    "A huge text. Blah blah blah... No line breaks. I'm a 28 year-old person called John Smith, etc..."
  ]
}

While researching about this error, I found: https://github.com/deeppavlov/DeepPavlov/issues/839#issuecomment-492209665

which says:

Sorry, but the BERT model has positional embeddings only for first 512 subtokens. So, the model can’t work with longer sequences. It is a deliberate architecture restriction. Subtokens are produced by WordPiece tokenizer (BPE). 512 subtokens correspond approximately to 300-350 regular tokens for multilingual model. Make sure that you performed sentence tokenization before dumping the data. Every sentence in the dumped data should be separated by an empty line.

But I don’t fully understand what I need to do to resolve the problem.

What does “Make sure that you performed sentence tokenization before dumping the data” mean? Is it some function I need to call first, that returns the list of tokens? Is it something that I can call with the REST API from my application/code?

I was also looking to see if I could have my application (the caller) to somehow tokenize the words and punctuation, and then only send the first 512, but the thing is that it’s hard to preserve the spacing, and even if I send 512, it somehow still passes that limit in the model, crashing anyway.
I feel like I’m trying to reinvent the wheel.

Can’t we have the API and/or the model just silently (or by setting a flag/parameter in the input) truncate the text input past 512 tokens?

(Note that my application is not made in Python)

Thank you very much!

ghnp5 commented 4 months ago

This actually resolves my issues: https://github.com/deeppavlov/DeepPavlov/pull/1657/files#diff-c2feefe4ebd288d44761cad4fbe6c29d43da997a00c597f6281e89ceca3a57d2