chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Wrong span for compounds? #264

Open ghost opened 5 years ago

ghost commented 5 years ago

Hey there, I was wondering if get_span_for_compound_noun(noun) is correctly implemented? Iterating over the lefts is not going to work for triple (plus) compounds as they won't be rooted at the same noun but in a cascading fashion (this would not get the compound noun "pool table stick" for example) This is what I ended up doing:

def get_span_for_compound_word(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end_i = noun.i
    start_i = end_i
    lefts = list(noun.lefts)
    while(lefts and lefts[-1].dep_ == "compound"):
        start_i -= 1
        lefts = list(lefts[-1].lefts)
    return (start_i, end_i)

Technically, for it to be a proper span it probably should have returned end_i+1?

bdewilde commented 5 years ago

Hey @ardeego , thanks for reporting! I'll fix this. Just to be sure I get this right, though, could you provide a few example sentences containing both 2- and 3+ word compound nouns? (I'm having a hard time coming up with good examples to test...)

bdewilde commented 5 years ago

Another reason for more examples is that I don't think I'm seeing the issue you mentioned in the first example I came up with:

>>> doc = nlp("The cherry blossom trees were in full bloom.")
>>> for tok in doc:
...     print(tok, "=>", [l, l.dep_) for l in tok.lefts])
The => []
cherry => []
blossom => []
trees => [(The, 'det'), (cherry, 'compound'), (blossom, 'compound')]
were => [(trees, 'nsubj')]
in => []
full => []
bloom => [(full, 'amod')]
. => []

You can see that both "cherry" and "blossom" are in the lefts of "trees".

ghost commented 5 years ago

Hi, I think for the most part it get's it right, but I've seen it here:

doc = nlp(u"The microsoft browser program window shadow isn't looking very good on the screen")
for tok in doc:
    print(tok, "=>", [(l, l.dep_) for l in tok.lefts])
The => []
microsoft => []
browser => [(microsoft, 'compound')]
program => []
window => [(browser, 'compound'), (program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I used this to better visualize the issue:

from spacy import displacy
displacy.render(doc, style="dep")

Most consistently it happens if a proper noun is part of the compound. Does this parse the same way for you? Otherwise this might be an issue with spaCy?

ghost commented 5 years ago

The code I posted earlier is not a fix either btw cuz it misses the places where your code skipped left correctly.

bdewilde commented 5 years ago

Mine actually parses differently. (I'm using the small English 2.1 model, you?).

The => []
microsoft => []
browser => [(microsoft, 'amod')]
program => [(browser, 'compound')]
window => [(program, 'compound')]
shadow => [(The, 'det'), (window, 'compound')]
is => []
n't => []
looking => [(shadow, 'nsubj'), (is, 'aux'), (n't, 'neg')]
very => []
good => [(very, 'advmod')]
on => []
the => []
screen => [(the, 'det')]

I'm honestly not sure how to proceed. If the parser were 100% correct, would we expect all nouns in a single compound noun to be in the .lefts of the primary noun? If not, what's the expected form of the parse?

ghost commented 5 years ago

I was using the large English model:

I'm not sure how to proceed either. Nor am I sure what 100% correct would mean for a compound, I thought going over the lefts on a noun chunk would be sufficient, til I found several examples where it was not. I have seen a whole lot of combination of compound dependency arrangements. I guess the full compound would be the longest "compound dependency path" in the parsing tree. I am sitting a bit on the fence on this, as I wonder if a longer compound noun is necessarily better than a partial one. If the compound word gets too long it would be come very specific and very rare over the corpus, so I wonder how much of a downstream value such a compound would serve. Maybe @honnibal can advise on the parsing correctness?

ghost commented 5 years ago

This is the non-recursive solution I ended up settling on (handles all parsing variations of compound nouns I came across). Not as slick as your itertools solution, but performance should be pretty good, as no generators had to be expanded into lists.

def get_span_for_compound_noun(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    end = noun.i
    start = end
    lefts = noun.lefts
    while(True):
        token = None
        for token in lefts:
            if token.dep_ == "compound":
                start = token.i
                lefts = token.lefts
                break
        if token:
            if start != token.i:
                break
        else:
            break
    return (start, end)
bdewilde commented 5 years ago

I'm still a bit torn about this, but if it works, I'm happy to accept the empirical evidence. :) Could you provide me with a handful of example sentences and their expected outputs? I'll add a test to make sure everything works as expected. Thanks for keeping on this!

ghost commented 5 years ago

Will make some time for it. Thanks.