ayota / ddl_nlp

Repo for DDL research lab project.
2 stars 1 forks source link

Bugfixes to wikipedia ingest to make it more robust #45

Closed lauralorenz closed 8 years ago

lauralorenz commented 8 years ago

While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.

To reproduce, use the data/eval_words files as your search term input. Though you may be able to reproduce just with the "bad" search term which was Activase. You'll see this error:

/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
    raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "active" may refer to: 
Active (album)
Active Records
Active (ship)
Active (1764 ship)
Active (1850)
Active (1877)
Active (sternwheeler)
HMS Active
USCS Active (1852)
USCGC Active
USRC Active
USS Active
Active (whaler)
Active Enterprises
Sky Active
Active (pharmacology)
Active, Alabama
ACTIVE
Locomotion No 1
fraternities and sororities
Active lifestyle
Activation
Activity (disambiguation)
Passive (disambiguation)
All pages beginning with "Active"

As far as we can tell this is because from the search term Activase we receive a results list including the page title Active, but when we later try and retrieve the wikipedia page with the wikipedia.page method against the title Active, it returns a DisambiguationError from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. against Activase) or the second round of search here (i.e. against Active), so we will need to catch both places.

To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was Malignant tumor of lung.

Renal failure
Kidney failure
Abortion
Miscarriage
Heart
Myocardium
Stroke
Delusion
Schizophrenia
Calcification
Stenosis
Tumor metastasis
Adenocarcinoma
Congestive heart failure
Pulmonary edema
Pulmonary fibrosis
Malignant tumor of lung
Diarrhea
Stomach cramps

This will trigger this error:

Traceback (most recent call last):
  File "fun_3000/get_corpus.py", line 85, in <module>
    fetch_corpus(search_terms, directory, results)
  File "fun_3000/get_corpus.py", line 40, in fetch_corpus
    wiki_search.get_wikipedia_pages(term, data_dir, results)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
    save_wiki_text(search_term, local_file_path)
  File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
    page = wpg(wiki_search_term)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 345, in __load
    raise PageError(self.title)
wikipedia.exceptions.PageError: Page id "malignant tumors of luna" does not match any pages. Try another id!

This one is a little more confusing at this point since the original term was actually malignant tumors of lung. At some point it tries to instantiate a WikipediaPage against the title malignant tumors of luna but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted as page_id unless page_id is explicitly cast for None.

For this one we might also want to figure out more related to the auto_suggest flag as that may be how we're getting the weird respelling of our original search term.