Closed user06039 closed 3 years ago
Here are the docs for spacy.Corpus.v1
: https://spacy.io/api/corpus
You can also write a custom corpus loader if you need different options.
@adrianeboyd If I make max_length = 0
, does it affect model accuracy? If the corpus split my document into sentences, will it later concat the embeddings of each sentence into one document for better NER predictions?
I am not able to understand the advantages or disadvantages of max_length = 0
We generally avoid truncating the inputs at all costs, preferring pretty much any other solution. Truncated inputs aren't real text, which is especially bad for the parser, but also bad for other components.
The main reason the max_length
option exists is to avoid memory problems, which is especially relevant for transformer models on GPU. The max_length
setting allows you to prevent long inputs from blowing up your training.
@honnibal An example,
If max_length = 4
O O O B-NAME I-NAME O O O O U-COMPANY My name is John Mat and I work at Google
Then the split happens,
O O O B-NAME
My name is John
I-NAME O O O O
Mat and I work at
U-COMPANY
Google
Does such type of problem happens if we decide to split a document into sentences based on max_length
? It splits the name entity into two different sentences? Does spacy do something to take care of such issue?
You can probably use max_length = 0
with your data without any issues. However if it runs out of memory while training, you might need to come back to this setting.
max_length
does not split up or truncate sentences, because we do not think this is a sensible thing to do. Instead, if the doc is too long it tries to use individual sentences from the doc instead, and if those are too long, it skips them entirely. Please try it out with your own corpus to see! For reference, the relevant code is here:
Be aware that the NER component does not predict entities across sentence boundaries, either. If your pipeline has a component that sets sentence boundaries before ner
(sentencizer
, senter
, parser
, etc.), this can affect the results.
(As a side note in case you decide to implement your own corpus reader, is_sentenced
is deprecated and should be replaced with has_annotation("SENT_START")
.)
You can probably use
max_length = 0
with your data without any issues. However if it runs out of memory while training, you might need to come back to this setting.
max_length
does not split up or truncate sentences, because we do not think this is a sensible thing to do. Instead, if the doc is too long it tries to use individual sentences from the doc instead, and if those are too long, it skips them entirely. Please try it out with your own corpus to see! For reference, the relevant code is here:Be aware that the NER component does not predict entities across sentence boundaries, either. If your pipeline has a component that sets sentence boundaries before
ner
(sentencizer
,senter
,parser
, etc.), this can affect the results.(As a side note in case you decide to implement your own corpus reader,
is_sentenced
is deprecated and should be replaced withhas_annotation("SENT_START")
.)
Does this apply to the en_core_web_trf mode as well?
The NER component is the same regardless of whether you use Transformers or not, so yes.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I am trying to do entity recognition with spacy v3, and this is my config file, Under [corpora.train], I found something called
max_length = 2000
, does this mean it will truncate if a sentence is longer than 2000 words?In my dataset, each document is of 1000-5000 words and I don't want to truncate anything? Do I have to change any parameter in the config file to get better results in such long document when doing NER?
There are no proper example on use-case based config file changes, Please help me out.