dealing with BYLINE issues

danielsgriffin commented 8 years ago

Rather than hacking into BYLINE solutions. I'm proposing that we catch the first sentence and disallow it from being used in a question, but without deleting the sentence itself. This will allow us to maintain the use of the entities from the first sentence of that article.

Catching it with a span index.

danielsgriffin commented 8 years ago

I think this works. I need to have test examples before and after (in my initial test the byline was already skipped.)

In mcq_generator.py for sent in self._top_sents: if sent.start < 2: continue

danielsgriffin commented 8 years ago

Ex. 1: (filename: Lawyer for 2 Americans Detained in UAE Says US Can Do More.txt) before solution: Issue in question: DUBAI, ___________ — A defense attorney for two Libyan-Americans imprisoned in the United Arab Emirates said Monday the U.S. government can do more to ensure they receive greater medical attention and are released.

**Issue in answer choices:*** v.mcq_g: DEBUG: The defendants acknowledge raising money for humanitarian supplies for the Libyan National Transitional Council with documented approval from the ___________ government. v.mcq_g: DEBUG: GPE v.mcq_g: DEBUG: ent_text: UAE v.mcq_g: DEBUG: other_choices: ['the United States', 'UAE', 'Abu Dhabi', 'DUBAI']

after solution: Issue remains in answer choices. Note: Solution was moved from mcq_generator.py to sentence_summarizer.py in order to improve the selection of sentences available for question development. (The latter sends the top five sentences. If one of those five was the first sentence it would be dropped if the solution was in the former. Keeping the solution in the latter ensures five sentences are still sent to the former for question generation.) Solution adjusted to (this will allow sentences where space succeeds in identifying and separating the byline to continue through to a question, see Test1 below):

            # Remove risk of sentence with byline forming questions.
            if sent.start == 0:
                continue

Test1: Ran on 399 New York Times articles. Results before solution*:

NO BYLINE = 226
spaCy CAUGHT BYLINE = 62 (Ex. first sentence as tokenized by spaCy: MANILA —)
spaCy FAILED TO CATCH BYLINE = 111 (Ex. first sentence as tokenized by spaCy: WHITESBURG, Ky. — Moonshine packs a punch in this corner of Appalachia, where making hooch is steeped in local lore.)

These results are inexact b/c followed rough rules. Some instances of the common journalistic practice of starting articles with an all cap word was also treated as a byline by my rules. Note that Spacy does well with bylines when it is a not including a city or country in regular caps.

Next: Identify and deal with entities in all_caps (only looking at those due to byline issues). Solution:

    def build_entities_dictionary(self):
        ...
            if entity.start == 0 and entity.text_with_ws.strip().isupper():
                continue

This will also remove some things highlighted in https://github.com/TheGadflyProject/TheGadflyProject/issues/38 (but only where they are people; I found one instance in my 399 article of that: 'HILLARY RODHAM CLINTON').

Rather than changing to proper case I am simply not allowing those entities to enter the entities dictionaries and be considered possible other answer choices. I think this is fine. It is likely that those entities appear in proper case elsewhere in the article and we will implement future solutions to create additional cities if the options are not enough.

after solution (back to testing the single named file above): DUBAI is successfully removed from possible answer choices.

danielsgriffin commented 8 years ago

Pushed to MCQ.

danielsgriffin commented 8 years ago

Instituted much simpler fix in q_generator_base.py: (In the function tfidf before sending sents through TF_IDFSummarizer) sents = [sent for sent in sents][1:] # Issue #37

danielsgriffin commented 8 years ago

This becomes a much tricker problem if we want to prioritize early sents for their importance (and so look at using the first sentence in questions).

vijayv commented 8 years ago

@danielsgriffin I addressed this in a fix on the gadfly-api, we should no longer be running into this problem.

vijayv commented 8 years ago

We should no longer exclude the first sentence of the article. Thanks for recording.

This was fixed with Regex using the following function:

  272 def clean_text(article):
  273     article = re.sub("“", '"', article)
  274     article = re.sub("”", '"', article)
  275     article = re.sub("’", "'", article)
  276     article = re.sub("Advertisement ", "", article)
  277     article = re.sub("Continue reading the main story", "", article)
+ 278     # Should replace "Photo RIO DE JANEIRO — BRAZIL" with "Photo - Brazil"
+ 279     article = re.sub("([A-Z]|\s)+\—", "", article)
  280     article = re.sub("Photo ", "", article)
  281     return re.sub("[\n*]", " ", article)

danielsgriffin commented 8 years ago

Does this get the below?

KABUL, Afghanistan —

http://www.nytimes.com/2016/04/12/world/middleeast/suicide-bomber-kills-at-least-12-afghan-army-recruits-on-bus.html

vijayv commented 8 years ago

@danielsgriffin The original rule did not... modified rule does:

([A-Za-z,]|\s)+\—

danielsgriffin commented 8 years ago

slight modification suggestion: return re.sub("[\s*]", " ", article)

vijayv commented 8 years ago

Done.

On Tue, May 3, 2016 at 6:44 PM danielsgriffin notifications@github.com wrote:

slight modification suggestion: return re.sub("[\s*]", " ", article)

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/TheGadflyProject/TheGadflyProject/issues/37#issuecomment-216718244

TheGadflyProject / TheGadflyProject

dealing with BYLINE issues #37

These results are inexact b/c followed rough rules. Some instances of the common journalistic practice of starting articles with an all cap word was also treated as a byline by my rules. Note that Spacy does well with bylines when it is a not including a city or country in regular caps.