explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.68k stars 4.36k forks source link

Noun chunking inconsistency #2451

Closed chozelinek closed 5 years ago

chozelinek commented 6 years ago

The problem

I've realised that sometimes noun chunks yield a noun chunk which is embedded in a longer one. I have only identified this behaviour in a few examples involving clauses with which.

Take the sentence "Including equity share of refineries in which the Group has a stake."

"the Group" and "in which the Group has a stake" are marked as noun chunks. But this does not happen normally. I put below a few examples so you can reproduce and study this.

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_md')

text0 = "American company listed on NASDAQ in which the Group holds a 23.51% interest as of December 31, 2016."
text1 = "Including equity share of refineries in which the Group has a stake."
text2 = "Prices for oil and natural gas may fluctuate widely due to many\nfactors over which TOTAL has no control."
text3 = "This\nscope, which is different from the “operated domain” mentioned\nabove, includes all the assets in which the Group has a financial\ninterest or rights to production.\n "
text4 = "GHG emissions are also published on an equity interest basis, i.e.,\nby consolidating the Group share of the emissions of all assets in\nwhich the Group has a financial interest or rights to production.\n "
text5 = "From this profit, minus prior losses, if any, the following items are\ndeducted in the order indicated:\n 1) 5% to constitute the legal reserve fund, until said fund reaches\n10% of the share capital;\n 2) the amounts set by the Shareholders’ Meeting to fund reserves\nfor which it determines the allocation or use; and\n 3) the amounts that the Shareholders’ Meeting decides to retain.\n "

texts = [text0, text1, text2, text3, text4, text5]

for i, t in enumerate(texts):
    print('# Noun chunks in text {}:'.format(i))
    doc = nlp(t)
    for np in doc.noun_chunks:
        print(np)

These are my comments on the texts analyzed:

Your Environment

ines commented 5 years ago

The noun chunks depend on the part-of-speech tags and dependency parse, so this issue likely comes down to incorrect predictions made by the tagger or parser.

I'm merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.