dmmiller612 / bert-extractive-summarizer

Easy to use extractive text summarization with BERT
MIT License
1.39k stars 305 forks source link

Neuralcoref doesn't seem to actually be working #33

Closed aced125 closed 3 years ago

aced125 commented 4 years ago

Hey authors,

Great repo so far. An issue: when I try to do run the body in the example (on the Chrysler building sale) through the neuralcoref code in the repo, it doesn't actually work...

For example, here is running the body through neuralcoref, and examining the clusters.

from spacy.lang.en import English

body  =  """
The Chrysler bulding was sold for ... [COPY AND PASTE EXACT HERE]
"""

nlp = English()                                                                                                
nlp.add_pipe(nlp.create_pipe('sentencizer'))     

neuralcoref.add_to_pipe(nlp, greedyness=0.45)                                                                  
#<spacy.lang.en.English at 0x7fce5d7d4110>

doc = nlp(body)                                                                                                
doc._.has_coref                                                                                                
#False

This the code used at the moment in modelprocessors.py.

However, if we try this, instead, it works:

import spacy
!python -m spacy download en_core_web_sm

body  =  """
The Chrysler bulding was sold for ... [COPY AND PASTE EXACT HERE]
"""

nlp = spacy.load('en_core_web_sm')
# Use the default dependency parser for sentence tokenization

neuralcoref.add_to_pipe(nlp, greedyness=0.45)

doc = nlp(body)
doc._.has_coref
#True
doc._.coref_resolved

"\nThe Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of \nThe Chrysler Building, the famous art deco New York skyscraper previous sales price.\nThe deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.\nMubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.\nReal estate firm Tishman Speyer had owned the other 10%.\nThe buyer is RFR Holding, a New York real estate company.\nOfficials with Tishman and RFR did not immediately respond to a request for comments.\nIt's unclear when the deal will close.\nthe building sold fairly quickly after being publicly placed on the market only two months ago.\nThe sale was handled by CBRE Group.\nThe incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.\nThe rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.\nMeantime, rents in the building are not rising nearly that fast.\nWhile the building is an iconic landmark in the New York skyline, the building is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.\nStill the building is among the best known in the city, even to people who have never been to New York.\nIt is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.\nIt has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.\nThe previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.\nStill there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world's tallest.\nBlackstone Group (BX) bought Blackstone Group (BX) for $1.3 billion 2015.\nthe building was the headquarters of the American automaker until 1953, but the building was named for and owned by Chrysler chief Walter Chrysler, not the company itself.\nWalter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. Walter Chrysler kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.\nOnce the competitor could rise no higher, the spire of the building was raised into view, giving the spire of the Chrysler building the title.\n"

Clearly, there are a lot of issues here (e.g "Blackstone Group (BX) bought Blackstone Group (BX) for $1.3 billion 2015").

So it is almost better that this repo is working without neuralcoref.

However, neuralcoref gets 65 F1 on OntoNotes, whereas in 3 years the state of the art has progressed to Bert or Span-Bert (~80 F1). So maybe, we should use those instead?

https://github.com/mandarjoshi90/coref

dmmiller612 commented 4 years ago

Neuralcoref has been nothing but a headache since it was added. I think a better strategy that I want to move to is to bring your own coreference model that works with spacy. I really want to get rid of the spacy/coreference dependency because it has been causing people issues with installation.