Noun chunking inconsistency

chozelinek commented 6 years ago

The problem

I've realised that sometimes noun chunks yield a noun chunk which is embedded in a longer one. I have only identified this behaviour in a few examples involving clauses with which.

Take the sentence "Including equity share of refineries in which the Group has a stake."

"the Group" and "in which the Group has a stake" are marked as noun chunks. But this does not happen normally. I put below a few examples so you can reproduce and study this.

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_md')

text0 = "American company listed on NASDAQ in which the Group holds a 23.51% interest as of December 31, 2016."
text1 = "Including equity share of refineries in which the Group has a stake."
text2 = "Prices for oil and natural gas may fluctuate widely due to many\nfactors over which TOTAL has no control."
text3 = "This\nscope, which is different from the “operated domain” mentioned\nabove, includes all the assets in which the Group has a financial\ninterest or rights to production.\n "
text4 = "GHG emissions are also published on an equity interest basis, i.e.,\nby consolidating the Group share of the emissions of all assets in\nwhich the Group has a financial interest or rights to production.\n "
text5 = "From this profit, minus prior losses, if any, the following items are\ndeducted in the order indicated:\n 1) 5% to constitute the legal reserve fund, until said fund reaches\n10% of the share capital;\n 2) the amounts set by the Shareholders’ Meeting to fund reserves\nfor which it determines the allocation or use; and\n 3) the amounts that the Shareholders’ Meeting decides to retain.\n "

texts = [text0, text1, text2, text3, text4, text5]

for i, t in enumerate(texts):
    print('# Noun chunks in text {}:'.format(i))
    doc = nlp(t)
    for np in doc.noun_chunks:
        print(np)

These are my comments on the texts analyzed:

Text 0: "the Group" and "in which the Group holds a 23.51% interest"
Text 1: "the Group" and "in which the Group has a stake".
Text 2: "TOTAL" and "over which TOTAL has no control".
Text 3: "the Group" and "in which the Group has a financial".
Text 4: no issue as per this example, this is the behaviour I expected.
Text 5: "it" and "for which it determines the allocation".

Your Environment

spaCy version: 2.0.11
Platform: Darwin-17.6.0-x86_64-i386-64bit
Python version: 3.6.3
Models: en_core_web_md, fr_core_news_md, es_core_news_md, de_core_news_sm, pt_core_news_sm, fr_core_news_sm

ines commented 5 years ago

The noun chunks depend on the part-of-speech tags and dependency parse, so this issue likely comes down to incorrect predictions made by the tagger or parser.

I'm merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy