IndexError: list index out of range for the documents reconstructed from DocBin

juliamakogon commented 1 year ago

IndexError: list index out of range for the documents reconstructed from DocBin without dependency, in kedro pipeline

.../lib/python3.9/site-packages/augmenty/span/entities.py:56 in ent_augmenter_v1          
     53 │   │   │   tok_anno["POS"][i] = ["PROPN"] * len_ent 
     54 │   │   │                                                                             
     55 │   │   │   tok_anno["MORPH"][i] = [""] * len_ent     
❱  56 │   │   │   tok_anno["DEP"][i] = [tok_anno["DEP"][i][0]] + ["flat"] * (len_ent - 1) 
     57 │   │   │    
     58 │   │   │   tok_anno["SENT_START"][i] = [tok_anno["SENT_START"][i][0]] + [0] * (    
     59 │   │   │   │   len_ent - 1

Augmenter is defined as

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True, # True or False doesn't change the behaviour
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = augmenty.docs(docs, repeated_augmenter, model)

Your Environment

augmenty Version Used: 1.3.2
spaCy version: 3.5.1
Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Pipelines: en_core_anno_floret (3.5.0) # Custom model

KennethEnevoldsen commented 1 year ago

Thanks for reporting this @juliamakogon, is it possible for you to send me an example so that I can reproduce the error? Then I will check to see if I handle the error.

If you can't I would love to get the full stacktrace (error message) to see whether is tok_anno["DEP"] which is missing or where the index error happens (DEP as I understand should always be there).

juliamakogon commented 1 year ago

Hi @KennethEnevoldsen, I will try to be as useful as possible in solving the problem. The augmenter worked on a subset of data before the crash. It's a Kedro project that uses the model with a custom tokenizer, so "M. Jones" => ["M", ".", "Jones"]. That's why in ent_dict I use the name "M. Jones" as a whole string. Trying to use the list of token ORTHs resulted in "M . Jones" in the resulting examples. The augmented data are planned to be saved as a .jsonl, so the person's name as one token shouldn't be a problem.

I have the Kedro node with the code

def augment_ner_documents(model,
                          docbin: DocBin,
                          ents_as_str: Iterable[str],
                          n_documents: int = None,
                          n_repeat: int = 1,
                          level: [int, float] = 1,
                          label: str = None,
                          ) -> DocBin:
    n_repeat = n_repeat if n_repeat >= 1 else 1
    if n_documents <= 0:
        n_documents = None
    docbin = ensure_docbin(model, docbin)
    docs = list(
        islice([doc for doc in docbin.get_docs(model.vocab) if any([e[0].ent_type_ == label for e in doc.ents])],
               n_documents))

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True,
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = augmenty.docs(docs, repeated_augmenter, model)

    result = DocBin(attrs=["ENT_IOB", "ENT_TYPE"], docs=augmented_docs)

    return result

The function ensure_docbin here is to convert text or json input files to DocBin, a workaround for Kedro. Its main block returns

DocBin(store_user_data=True, docs=docs)

where docs are constructed with make_doc + doc.ents = some_ner_spans.

docbin attrs field contains [65, 67, 73, 74, 75, 76, 77, 78, 79, 80, 452, 453, 454] => ['ORTH', 'NORM', 'LEMMA', 'POS', 'TAG', 'DEP', 'ENT_IOB', 'ENT_TYPE', 'HEAD', 'SENT_START', 'ENT_KB_ID', 'MORPH', 'ENT_ID']

When error occurs in ent_augmenter_v1, tok_anno["DEP"][I] is an empty list:

i = slice(14, 16, None)

tok_anno = {'ORTH': ['Melvin R. Brown', 'and', 'Thairo', 'Kristiina Mäkelä', 'concussion', 'protocol', ')', 'are', 'each', 'progressing', '.', 'SS', 'Brandon', 'Crawford', 'Mark Folkard'], 'SPACY': [True, True, True, False, True, False, True, True, True, False, True, True, True, False], 'TAG': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'LEMMA': ['Melvin R. Brown', '', '', 'Kristiina Mäkelä', '', '', '', '', '', '', '', '', '', '', 'Mark Folkard'], 'POS': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'MORPH': ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'DEP': ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]}

ent = "Brandon Crawford"

ents = ['U-pers', 'O', 'B-pers', 'U-pers', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-pers', 'L-pers']

example="Joc Pederson and Thairo Estrada (concussion protocol) are each progressing. SS Brandon Crawford"

example_dict={'doc_annotation': {'cats': {}, 'entities': ['U-pers', 'O', 'B-pers', 'U-pers', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-pers', 'L-pers'], 'spans': {}, 'links': {}}, 'token_annotation': {'ORTH': ['Melvin R. Brown', 'and', 'Thairo', 'Kristiina Mäkelä', 'concussion', 'protocol', ')', 'are', 'each', 'progressing', '.', 'SS', 'Brandon', 'Crawford', 'Mark Folkard'], 'SPACY': [True, True, True, False, True, False, True, True, True, False, True, True, True, False], 'TAG': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'LEMMA': ['Melvin R. Brown', '', '', 'Kristiina Mäkelä', '', '', '', '', '', '', '', '', '', '', 'Mark Folkard'], 'POS': ['PROPN', '', '', 'PROPN', '', '', '', '', '', '', '', '', '', '', 'PROPN'], 'MORPH': ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'DEP': ['', '', '', '', '', '', '', '', '', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]}}

len_ent = 1

level = 1

new_value = ['Mark Folkard']

offset = 0

replaced_ents = {'Brandon Crawford': ['Mark Folkard'], 'Joc Pederson': ['Melvin R. Brown'], 'Thairo Estrada': ['Kristiina Mäkelä']}

KennethEnevoldsen commented 1 year ago

Seems like you repeat the augmenter 1 time. However, it should already replace 100% of the entities so doing it twice shouldn't do anything. Though it shouldn't cause a problem either (!).

Tried to reproduce the error:

import spacy
from spacy.training import Example

import augmenty

# pipeline without a parser
nlp = spacy.load("en_core_web_sm", disable=["parser"])
doc = nlp("My name is Kenneth. This is a test.")

e = Example(doc, doc)
print(e.to_dict()["token_annotation"]["DEP"])
# ['', '', '', '', '', '', '', '', '', '']
# does contain DEP as a list

print(doc.ents)
# (Kenneth,)
doc.ents[0].label_
# 'PERSON'

aug = augmenty.load(
    "ents_replace_v1",
    level=1,
    ent_dict={"PERSON": [["Mr", ".", "Black"], ["t"]]},
    replace_consistency=True,
    resolve_dependencies=True,
)
aug = augmenty.repeat(augmenter=aug, n=1)
augmented_docs = augmenty.docs([doc], augmenter=aug, nlp=nlp)

aug_doc = list(augmented_docs)[0]

# recreate the example
e = Example(aug_doc, aug_doc)

print(e.to_dict()["token_annotation"]["DEP"])  # works just fine
# ["", "", "", "", "flat", "flat", "", "", "", "", "", ""]

but can't reproduce any cases where dep is not a list.

I have updated the function in augmenty so you might want to try the new version, it might solve your problem (but probably just give us a better error message).

If you can give me a full example which fails I can try to run that as well.

juliamakogon commented 1 year ago

Thanks! I'll try the new version of the function. I hope I'll have time today to make the case more reproducible. Line 101 in the https://github.com/KennethEnevoldsen/augmenty/blob/main/src/augmenty/span/entities.py bothers me because of the custom tokenizer we have in the project, from the point of reproducibility:

    text = make_text_from_orth(example_dict)
    doc = nlp.make_doc(text)

Looks like I should check on the standard en_core_web_sm too.

Could you please clarify, why you think it's a double use of augmenter in my code?

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True,
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = augmenty.docs(docs, repeated_augmenter, model)

KennethEnevoldsen commented 1 year ago

Line 101 in the https://github.com/KennethEnevoldsen/augmenty/blob/main/src/augmenty/span/entities.py bothers me because of the custom tokenizer we have in the project, from the point of reproducibility:

But shouldn't this use the tokenizer in the specified model? Isn't that the desired behaviour, or am I missing something?

Could you please clarify, why you think it's a double use of augmenter in my code?

Sorry I seem to have confused myself (the docstring for the augmenty.repeat function was slightly misleading I have fixed it now). Hopefully it is more clear now.

What I was thinking it did was: repeatedly apply the same augmenter to the same documents i.e.

aug_doc = augment.docs([doc], augmenter, model)
aug_aug_doc = augment.docs([aug_doc], augmenter, model)

but it actually does do (which it should, it was simply a brainfart from my side):

aug_doc = augment.docs([doc], augmenter, model)
one_more_aug_doc = augment.docs([doc], augmenter, model)

Which makes it great for upsampling certain entities

juliamakogon commented 1 year ago

I tried the new version of augmenty. It looks like now the problem is in the next line:

tok_anno["SENT_START"][i] = [tok_anno["SENT_START"][i][0]] + [0] * (
                    len_ent - 1
                )

I made a unit test, both tests with augmenter crash:

import pytest
import spacy
import augmenty
from spacy.tokens import DocBin, Span

@pytest.fixture
def nlp():
    return spacy.blank("en")

@pytest.fixture()
def sentencizer(nlp):
    return nlp.create_pipe("sentencizer")

@pytest.fixture
def docbin_no_dep(nlp) -> DocBin:
    text = "Joc Pederson and Thairo Estrada (concussion protocol) are each progressing. SS Brandon Crawford"
    doc = nlp.make_doc(text)
    doc.ents = [Span(doc, 0, 2, "pers"), Span(doc, 3, 5, "pers"), Span(doc, 14, 16, "pers")]
    docbin_no_dep_ = DocBin(store_user_data=True, docs=[doc])
    return docbin_no_dep_

def test_smoke_docbin_no_dep(nlp, docbin_no_dep: DocBin):
    doc = list(docbin_no_dep.get_docs(nlp.vocab))[0]
    assert [t.text for t in doc] == ["Joc", "Pederson", "and", "Thairo", "Estrada", "(", "concussion", "protocol", ")", "are", "each", "progressing", ".", "SS", "Brandon", "Crawford"]
    assert [e.text for e in doc.ents] == ["Joc Pederson", "Thairo Estrada", "Brandon Crawford"]

def test_smoke_docbin_no_dep_sent(nlp, sentencizer, docbin_no_dep: DocBin):
    doc = list(docbin_no_dep.get_docs(nlp.vocab))[0]
    doc = sentencizer(doc)
    assert len(list(doc.sents)) == 2

def test_augmenty_dependency_bug(nlp, docbin_no_dep: DocBin):
    level = 1.
    n_repeat = 3
    docs = list(docbin_no_dep.get_docs(nlp.vocab))
    ents_as_str = ['Melvin R. Brown']
    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True,
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = list(augmenty.docs(docs, repeated_augmenter, nlp))
    assert augmented_docs

def test_augmenty_dependency_bug_with_sent(nlp, sentencizer, docbin_no_dep: DocBin):
    level = 1.
    n_repeat = 3
    docs = list([sentencizer(doc) for doc in docbin_no_dep.get_docs(nlp.vocab)])
    # ents_as_str = ['Mark Folkard', 'Melvin R. Brown', 'Kristiina Mäkelä']
    ents_as_str = ['Melvin R. Brown']
    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True,
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = list(augmenty.docs(docs, repeated_augmenter, nlp))
    assert augmented_docs

KennethEnevoldsen commented 1 year ago

Thanks for supplying the test @juliamakogon, makes it very easy to fix the errors.

Turned out that the main problem was that the offset (you have to keep track of it since you replace multiple entities at the same time) was only calculated in the entities had the "HEAD" annotation (i.e. the pipeline includes dependency parsing).

There is still the problem setting multi token entities. E.g. let's say that you want to set: "Melvin R. Brown", you can either set it as one token "Melvin R. Brown", which is not really what you want, but if you set it as multiple tokens it becomes "Melvin R . Brown" (with the space).

Was considering fixing by simply using the nlp object to tokenise the string. Could be problematic as it might tokenise it differently depending on context (though I can't think of an edge case). So probably a reasonable assumption. If the user wants more control they could supply the entity as a span?

KennethEnevoldsen commented 1 year ago

Was considering fixing by simply using the nlp object to tokenise the string. Could be problematic as it might tokenise it differently depending on context (though I can't think of an edge case). So probably a reasonable assumption. If the user wants more control they could supply the entity as a span?

Actually just added that functionality. So now you can simply do:

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": ['Melvin R. Brown']},
                              replace_consistency=True,
                              resolve_dependencies=True
                              )

Then "Melvin R. Brown" will be tokenised using your specified tokenizer. If you pass it in as a Span it will also transfer everything except for the dependency tree.

juliamakogon commented 1 year ago

Thanks for fixing the issue! Now all's well. The last somewhat nitpicky proposition is to allow entities in ent_dict to be Doc, not only Span. I just don't see a scenario where one has the ready-made Span there, it looks to me ents = [nlp.make_doc(s)[:] for s in ents_as_str] is the most common use. (Ok, a surname is a Span, but then we probably have a Doc for the full name too).

KennethEnevoldsen commented 1 year ago

Are you sure it does not allow for Docs as is? (I know it is not type hinted for it, but I am not sure I use any specific features of the span)

edit: Ahh no it does not. I will add that support edit: it is now added

KennethEnevoldsen / augmenty

IndexError: list index out of range for the documents reconstructed from DocBin #170

Augmenter is defined as

Your Environment