Closed juliamakogon closed 1 year ago
Thanks for reporting this @juliamakogon, is it possible for you to send me an example so that I can reproduce the error? Then I will check to see if I handle the error.
If you can't I would love to get the full stacktrace (error message) to see whether is tok_anno["DEP"] which is missing or where the index error happens (DEP as I understand should always be there).
Hi @KennethEnevoldsen, I will try to be as useful as possible in solving the problem. The augmenter worked on a subset of data before the crash. It's a Kedro project that uses the model with a custom tokenizer, so "M. Jones" => ["M", ".", "Jones"]. That's why in ent_dict I use the name "M. Jones" as a whole string. Trying to use the list of token ORTHs resulted in "M . Jones" in the resulting examples. The augmented data are planned to be saved as a .jsonl, so the person's name as one token shouldn't be a problem.
I have the Kedro node with the code
def augment_ner_documents(model,
docbin: DocBin,
ents_as_str: Iterable[str],
n_documents: int = None,
n_repeat: int = 1,
level: [int, float] = 1,
label: str = None,
) -> DocBin:
n_repeat = n_repeat if n_repeat >= 1 else 1
if n_documents <= 0:
n_documents = None
docbin = ensure_docbin(model, docbin)
docs = list(
islice([doc for doc in docbin.get_docs(model.vocab) if any([e[0].ent_type_ == label for e in doc.ents])],
n_documents))
augmenter = augmenty.load("ents_replace_v1",
level=level,
ent_dict={"pers": [[s] for s in ents_as_str]},
replace_consistency=True,
resolve_dependencies=True
)
repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
augmented_docs = augmenty.docs(docs, repeated_augmenter, model)
result = DocBin(attrs=["ENT_IOB", "ENT_TYPE"], docs=augmented_docs)
return result
The function ensure_docbin
here is to convert text or json input files to DocBin, a workaround for Kedro. Its main block returns
DocBin(store_user_data=True, docs=docs)
where docs are constructed with make_doc + doc.ents = some_ner_spans.
docbin attrs field contains [65, 67, 73, 74, 75, 76, 77, 78, 79, 80, 452, 453, 454] => ['ORTH', 'NORM', 'LEMMA', 'POS', 'TAG', 'DEP', 'ENT_IOB', 'ENT_TYPE', 'HEAD', 'SENT_START', 'ENT_KB_ID', 'MORPH', 'ENT_ID']
When error occurs in ent_augmenter_v1, tok_anno["DEP"][I] is an empty list:
i = slice(14, 16, None)
tok_anno = {'ORTH': ['Melvin R. Brown', 'and', 'Thairo', 'Kristiina Mäkelä', 'concussion', 'protocol', ')', 'are', 'each', 'progressing', '.', 'SS', 'Brandon', 'Crawford', 'Mark Folkard'], 'SPACY': [True, True, True, False, True, False, True, True, True, False, True, True, True, False], 'TAG': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'LEMMA': ['Melvin R. Brown', '', '', 'Kristiina Mäkelä', '', '', '', '', '', '', '', '', '', '', 'Mark Folkard'], 'POS': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'MORPH': ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'DEP': ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]}
ent = "Brandon Crawford"
ents = ['U-pers', 'O', 'B-pers', 'U-pers', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-pers', 'L-pers']
example="Joc Pederson and Thairo Estrada (concussion protocol) are each progressing. SS Brandon Crawford"
example_dict={'doc_annotation': {'cats': {}, 'entities': ['U-pers', 'O', 'B-pers', 'U-pers', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-pers', 'L-pers'], 'spans': {}, 'links': {}}, 'token_annotation': {'ORTH': ['Melvin R. Brown', 'and', 'Thairo', 'Kristiina Mäkelä', 'concussion', 'protocol', ')', 'are', 'each', 'progressing', '.', 'SS', 'Brandon', 'Crawford', 'Mark Folkard'], 'SPACY': [True, True, True, False, True, False, True, True, True, False, True, True, True, False], 'TAG': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'LEMMA': ['Melvin R. Brown', '', '', 'Kristiina Mäkelä', '', '', '', '', '', '', '', '', '', '', 'Mark Folkard'], 'POS': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'MORPH': ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'DEP': ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]}}
len_ent = 1
level = 1
new_value = ['Mark Folkard']
offset = 0
replaced_ents = {'Brandon Crawford': ['Mark Folkard'], 'Joc Pederson': ['Melvin R. Brown'], 'Thairo Estrada': ['Kristiina Mäkelä']}
Seems like you repeat the augmenter 1 time. However, it should already replace 100% of the entities so doing it twice shouldn't do anything. Though it shouldn't cause a problem either (!).
Tried to reproduce the error:
import spacy
from spacy.training import Example
import augmenty
# pipeline without a parser
nlp = spacy.load("en_core_web_sm", disable=["parser"])
doc = nlp("My name is Kenneth. This is a test.")
e = Example(doc, doc)
print(e.to_dict()["token_annotation"]["DEP"])
# ['', '', '', '', '', '', '', '', '', '']
# does contain DEP as a list
print(doc.ents)
# (Kenneth,)
doc.ents[0].label_
# 'PERSON'
aug = augmenty.load(
"ents_replace_v1",
level=1,
ent_dict={"PERSON": [["Mr", ".", "Black"], ["t"]]},
replace_consistency=True,
resolve_dependencies=True,
)
aug = augmenty.repeat(augmenter=aug, n=1)
augmented_docs = augmenty.docs([doc], augmenter=aug, nlp=nlp)
aug_doc = list(augmented_docs)[0]
# recreate the example
e = Example(aug_doc, aug_doc)
print(e.to_dict()["token_annotation"]["DEP"]) # works just fine
# ["", "", "", "", "flat", "flat", "", "", "", "", "", ""]
but can't reproduce any cases where dep is not a list.
I have updated the function in augmenty so you might want to try the new version, it might solve your problem (but probably just give us a better error message).
If you can give me a full example which fails I can try to run that as well.
Thanks! I'll try the new version of the function. I hope I'll have time today to make the case more reproducible. Line 101 in the https://github.com/KennethEnevoldsen/augmenty/blob/main/src/augmenty/span/entities.py bothers me because of the custom tokenizer we have in the project, from the point of reproducibility:
text = make_text_from_orth(example_dict)
doc = nlp.make_doc(text)
Looks like I should check on the standard en_core_web_sm too.
Could you please clarify, why you think it's a double use of augmenter in my code?
augmenter = augmenty.load("ents_replace_v1",
level=level,
ent_dict={"pers": [[s] for s in ents_as_str]},
replace_consistency=True,
resolve_dependencies=True
)
repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
augmented_docs = augmenty.docs(docs, repeated_augmenter, model)
Line 101 in the https://github.com/KennethEnevoldsen/augmenty/blob/main/src/augmenty/span/entities.py bothers me because of the custom tokenizer we have in the project, from the point of reproducibility:
But shouldn't this use the tokenizer in the specified model? Isn't that the desired behaviour, or am I missing something?
Could you please clarify, why you think it's a double use of augmenter in my code?
Sorry I seem to have confused myself (the docstring for the augmenty.repeat function was slightly misleading I have fixed it now). Hopefully it is more clear now.
What I was thinking it did was: repeatedly apply the same augmenter to the same documents i.e.
aug_doc = augment.docs([doc], augmenter, model)
aug_aug_doc = augment.docs([aug_doc], augmenter, model)
but it actually does do (which it should, it was simply a brainfart from my side):
aug_doc = augment.docs([doc], augmenter, model)
one_more_aug_doc = augment.docs([doc], augmenter, model)
Which makes it great for upsampling certain entities
I tried the new version of augmenty. It looks like now the problem is in the next line:
tok_anno["SENT_START"][i] = [tok_anno["SENT_START"][i][0]] + [0] * (
len_ent - 1
)
I made a unit test, both tests with augmenter crash:
import pytest
import spacy
import augmenty
from spacy.tokens import DocBin, Span
@pytest.fixture
def nlp():
return spacy.blank("en")
@pytest.fixture()
def sentencizer(nlp):
return nlp.create_pipe("sentencizer")
@pytest.fixture
def docbin_no_dep(nlp) -> DocBin:
text = "Joc Pederson and Thairo Estrada (concussion protocol) are each progressing. SS Brandon Crawford"
doc = nlp.make_doc(text)
doc.ents = [Span(doc, 0, 2, "pers"), Span(doc, 3, 5, "pers"), Span(doc, 14, 16, "pers")]
docbin_no_dep_ = DocBin(store_user_data=True, docs=[doc])
return docbin_no_dep_
def test_smoke_docbin_no_dep(nlp, docbin_no_dep: DocBin):
doc = list(docbin_no_dep.get_docs(nlp.vocab))[0]
assert [t.text for t in doc] == ["Joc", "Pederson", "and", "Thairo", "Estrada", "(", "concussion", "protocol", ")", "are", "each", "progressing", ".", "SS", "Brandon", "Crawford"]
assert [e.text for e in doc.ents] == ["Joc Pederson", "Thairo Estrada", "Brandon Crawford"]
def test_smoke_docbin_no_dep_sent(nlp, sentencizer, docbin_no_dep: DocBin):
doc = list(docbin_no_dep.get_docs(nlp.vocab))[0]
doc = sentencizer(doc)
assert len(list(doc.sents)) == 2
def test_augmenty_dependency_bug(nlp, docbin_no_dep: DocBin):
level = 1.
n_repeat = 3
docs = list(docbin_no_dep.get_docs(nlp.vocab))
ents_as_str = ['Melvin R. Brown']
augmenter = augmenty.load("ents_replace_v1",
level=level,
ent_dict={"pers": [[s] for s in ents_as_str]},
replace_consistency=True,
resolve_dependencies=True
)
repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
augmented_docs = list(augmenty.docs(docs, repeated_augmenter, nlp))
assert augmented_docs
def test_augmenty_dependency_bug_with_sent(nlp, sentencizer, docbin_no_dep: DocBin):
level = 1.
n_repeat = 3
docs = list([sentencizer(doc) for doc in docbin_no_dep.get_docs(nlp.vocab)])
# ents_as_str = ['Mark Folkard', 'Melvin R. Brown', 'Kristiina Mäkelä']
ents_as_str = ['Melvin R. Brown']
augmenter = augmenty.load("ents_replace_v1",
level=level,
ent_dict={"pers": [[s] for s in ents_as_str]},
replace_consistency=True,
resolve_dependencies=True
)
repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
augmented_docs = list(augmenty.docs(docs, repeated_augmenter, nlp))
assert augmented_docs
Thanks for supplying the test @juliamakogon, makes it very easy to fix the errors.
Turned out that the main problem was that the offset (you have to keep track of it since you replace multiple entities at the same time) was only calculated in the entities had the "HEAD" annotation (i.e. the pipeline includes dependency parsing).
There is still the problem setting multi token entities. E.g. let's say that you want to set: "Melvin R. Brown", you can either set it as one token "Melvin R. Brown", which is not really what you want, but if you set it as multiple tokens it becomes "Melvin R . Brown" (with the space).
Was considering fixing by simply using the nlp object to tokenise the string. Could be problematic as it might tokenise it differently depending on context (though I can't think of an edge case). So probably a reasonable assumption. If the user wants more control they could supply the entity as a span?
Was considering fixing by simply using the nlp object to tokenise the string. Could be problematic as it might tokenise it differently depending on context (though I can't think of an edge case). So probably a reasonable assumption. If the user wants more control they could supply the entity as a span?
Actually just added that functionality. So now you can simply do:
augmenter = augmenty.load("ents_replace_v1",
level=level,
ent_dict={"pers": ['Melvin R. Brown']},
replace_consistency=True,
resolve_dependencies=True
)
Then "Melvin R. Brown" will be tokenised using your specified tokenizer. If you pass it in as a Span it will also transfer everything except for the dependency tree.
Thanks for fixing the issue! Now all's well.
The last somewhat nitpicky proposition is to allow entities in ent_dict to be Doc, not only Span. I just don't see a scenario where one has the ready-made Span there, it looks to me ents = [nlp.make_doc(s)[:] for s in ents_as_str]
is the most common use. (Ok, a surname is a Span, but then we probably have a Doc for the full name too).
Are you sure it does not allow for Docs as is? (I know it is not type hinted for it, but I am not sure I use any specific features of the span)
edit: Ahh no it does not. I will add that support edit: it is now added
IndexError: list index out of range for the documents reconstructed from DocBin without dependency, in kedro pipeline
Augmenter is defined as
Your Environment