While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.
Disambiguation errors
To reproduce, use the data/eval_words files as your search term input. Though you may be able to reproduce just with the "bad" search term which was Activase. You'll see this error:
/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
Traceback (most recent call last):
File "fun_3000/get_corpus.py", line 85, in <module>
fetch_corpus(search_terms, directory, results)
File "fun_3000/get_corpus.py", line 40, in fetch_corpus
wiki_search.get_wikipedia_pages(term, data_dir, results)
File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
save_wiki_text(search_term, local_file_path)
File "/Users/llorenz/Development/ddl/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
page = wpg(wiki_search_term)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)
File "/Users/llorenz/Envs/ddl_nlp/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "active" may refer to:
Active (album)
Active Records
Active (ship)
Active (1764 ship)
Active (1850)
Active (1877)
Active (sternwheeler)
HMS Active
USCS Active (1852)
USCGC Active
USRC Active
USS Active
Active (whaler)
Active Enterprises
Sky Active
Active (pharmacology)
Active, Alabama
ACTIVE
Locomotion No 1
fraternities and sororities
Active lifestyle
Activation
Activity (disambiguation)
Passive (disambiguation)
All pages beginning with "Active"
As far as we can tell this is because from the search term Activase we receive a results list including the page title Active, but when we later try and retrieve the wikipedia page with the wikipedia.page method against the title Active, it returns a DisambiguationError from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. against Activase) or the second round of search here (i.e. against Active), so we will need to catch both places.
Page ID not found
To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was Malignant tumor of lung.
Traceback (most recent call last):
File "fun_3000/get_corpus.py", line 85, in <module>
fetch_corpus(search_terms, directory, results)
File "fun_3000/get_corpus.py", line 40, in fetch_corpus
wiki_search.get_wikipedia_pages(term, data_dir, results)
File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 52, in get_wikipedia_pages
save_wiki_text(search_term, local_file_path)
File "/Users/donaldvetal/Projects/ddl_nlp/fun_3000/ingestion/wikipedia_ingest.py", line 17, in save_wiki_text
page = wpg(wiki_search_term)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)
File "/Users/donaldvetal/anaconda/lib/python2.7/site-packages/wikipedia/wikipedia.py", line 345, in __load
raise PageError(self.title)
wikipedia.exceptions.PageError: Page id "malignant tumors of luna" does not match any pages. Try another id!
This one is a little more confusing at this point since the original term was actually malignant tumors of lung. At some point it tries to instantiate a WikipediaPage against the title malignant tumors of luna but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted as page_id unless page_id is explicitly cast for None.
For this one we might also want to figure out more related to the auto_suggest flag as that may be how we're getting the weird respelling of our original search term.
While running the larger corpus we noticed some errors occuring while doing wikipedia ingestion. This issue is to address those two errors.
To reproduce, use the
data/eval_words
files as your search term input. Though you may be able to reproduce just with the "bad" search term which wasActivase
. You'll see this error:As far as we can tell this is because from the search term
Activase
we receive a results list including the page titleActive
, but when we later try and retrieve the wikipedia page with thewikipedia.page
method against the titleActive
, it returns aDisambiguationError
from the wikipedia API. In these cases, we want to drop the search result since we will not be able to determine how to disambiguate it programmatically. This could occur at both the first search here (i.e. againstActivase
) or the second round of search here (i.e. againstActive
), so we will need to catch both places.To reproduce, use the following search terms. Though you may be able to reproduce using just the "bad" search term which was
Malignant tumor of lung
.This will trigger this error:
This one is a little more confusing at this point since the original term was actually
malignant tumors of lung
. At some point it tries to instantiate aWikipediaPage
against the titlemalignant tumors of luna
but fails to find the page. This is further confused because I think there is a bug in the path for the PageError to return the right error message; here it is complaining about page id, though we searched by title, but as I mentioned I actually think that is a wikipedia driver bug since this implies to me the first positional argument will always be interpreted aspage_id
unlesspage_id
is explicitly cast forNone
.For this one we might also want to figure out more related to the
auto_suggest
flag as that may be how we're getting the weird respelling of our original search term.