explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Does spacy has any sentence limit for named entity recognition? #7094

Closed user06039 closed 3 years ago

user06039 commented 3 years ago

I am trying to do entity recognition with spacy v3, and this is my config file, Under [corpora.train], I found something called max_length = 2000, does this mean it will truncate if a sentence is longer than 2000 words?

In my dataset, each document is of 1000-5000 words and I don't want to truncate anything? Do I have to change any parameter in the config file to get better results in such long document when doing NER?

There are no proper example on use-case based config file changes, Please help me out.

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 50
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
before_init = null
after_init = null

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.components]

[initialize.tokenizer]
adrianeboyd commented 3 years ago

Here are the docs for spacy.Corpus.v1: https://spacy.io/api/corpus

You can also write a custom corpus loader if you need different options.

user06039 commented 3 years ago

@adrianeboyd If I make max_length = 0, does it affect model accuracy? If the corpus split my document into sentences, will it later concat the embeddings of each sentence into one document for better NER predictions?

I am not able to understand the advantages or disadvantages of max_length = 0

honnibal commented 3 years ago

We generally avoid truncating the inputs at all costs, preferring pretty much any other solution. Truncated inputs aren't real text, which is especially bad for the parser, but also bad for other components.

The main reason the max_length option exists is to avoid memory problems, which is especially relevant for transformer models on GPU. The max_length setting allows you to prevent long inputs from blowing up your training.

user06039 commented 3 years ago

@honnibal An example, If max_length = 4

O O O B-NAME I-NAME O O O O U-COMPANY My name is John Mat and I work at Google

Then the split happens,

 O     O       O     B-NAME
My   name      is      John  
I-NAME        O      O       O         O
  Mat        and     I      work      at
U-COMPANY
  Google

Does such type of problem happens if we decide to split a document into sentences based on max_length? It splits the name entity into two different sentences? Does spacy do something to take care of such issue?

adrianeboyd commented 3 years ago

You can probably use max_length = 0 with your data without any issues. However if it runs out of memory while training, you might need to come back to this setting.

max_length does not split up or truncate sentences, because we do not think this is a sensible thing to do. Instead, if the doc is too long it tries to use individual sentences from the doc instead, and if those are too long, it skips them entirely. Please try it out with your own corpus to see! For reference, the relevant code is here:

https://github.com/explosion/spaCy/blob/4188beda871b1e40eb8d02b8a787b6878c89717e/spacy/training/corpus.py#L150-L163

Be aware that the NER component does not predict entities across sentence boundaries, either. If your pipeline has a component that sets sentence boundaries before ner (sentencizer, senter, parser, etc.), this can affect the results.

(As a side note in case you decide to implement your own corpus reader, is_sentenced is deprecated and should be replaced with has_annotation("SENT_START").)

ginward commented 3 years ago

You can probably use max_length = 0 with your data without any issues. However if it runs out of memory while training, you might need to come back to this setting.

max_length does not split up or truncate sentences, because we do not think this is a sensible thing to do. Instead, if the doc is too long it tries to use individual sentences from the doc instead, and if those are too long, it skips them entirely. Please try it out with your own corpus to see! For reference, the relevant code is here:

https://github.com/explosion/spaCy/blob/4188beda871b1e40eb8d02b8a787b6878c89717e/spacy/training/corpus.py#L150-L163

Be aware that the NER component does not predict entities across sentence boundaries, either. If your pipeline has a component that sets sentence boundaries before ner (sentencizer, senter, parser, etc.), this can affect the results.

(As a side note in case you decide to implement your own corpus reader, is_sentenced is deprecated and should be replaced with has_annotation("SENT_START").)

Does this apply to the en_core_web_trf mode as well?

polm commented 3 years ago

The NER component is the same regardless of whether you use Transformers or not, so yes.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.