dssg / givinggraph

An API tool to help understand the relationships between non-profits, for-profits, and the causes they support.
https://github.com/dssg/givinggraph/wiki/API
MIT License
28 stars 13 forks source link

Our Yahoo search function doesn't deal with long URLs correctly #1

Closed JohnHBrock closed 11 years ago

JohnHBrock commented 11 years ago

In yahoo.search.get_search_results(...), we're using BeautifulSoup and a regex to parse the titles of search results, which lets us avoid executing javascript.

Unfortunately, Yahoo truncates the URLs when displaying them in the titles of search results, so you end up with stuff like:

We'll need to either switch back to executing javascript using something like PhantomJS or maybe do some clever manual parsing of javascript.

This is currently not a high priority because we're only using it to find the URLs of Twitter accounts, which are short.

atqamar commented 11 years ago

I have no such issues when scraping yahoo search.

url = 'http://search.yahoo.com/search?p=' + query # "query" is what you're searching for soup = BeautifulSoup(urllib2.urlopen(url)) results = soup.findAll('a',{'class':'yschttl spt'}) # results top 10 yahoo search results links = [i['href'] for i in search_results] # list of search links in full

On Wed, Jul 31, 2013 at 11:10 AM, JohnHBrock notifications@github.comwrote:

In yahoo.search.get_search_results(...), we're using BeautifulSoup and a regex to parse the titles of search results, which lets us avoid executing javascript.

Unfortunately, Yahoo truncates the URLs when displaying them in the titles of search results, so you end up with stuff like:

  • secure.peta.org/site/Advocacy?cmd=display&page=...
  • www.foxnews.com/us/2013/07/29/cnn-new-crossfire-producer...

We'll need to either switch back to executing javascript using something like PhantomJS or maybe do some clever manual parsing of javascript.

This is currently not a high priority because we're only using it to find the URLs of Twitter accounts, which are short.

— Reply to this email directly or view it on GitHubhttps://github.com/dssg/givinggraph/issues/1 .