Horoscope plugin's HTML scraping no longer matches the page we scrape

nasonfish commented 8 years ago

We HTML scrape from a site and that site changed their HTML such that the fields we were using previously no longer match the classes we use in horoscope.py, resulting in us not being able to find a sign and returning an error no matter what.

edwardslabs commented 8 years ago

I think we should switch away from html scraping if possible. It looks like there are a few api's available this was the top google hit and it it seems like it could work: https://github.com/tapasweni-pathak/Horoscope-API

dmptrluke commented 8 years ago

Thats a web frontend for https://testpypi.python.org/pypi/horoscope, which rips from Ganeshaspeaks... sooo, not really better.

On Fri, Oct 16, 2015 at 1:24 AM, Andy Edwards notifications@github.com wrote:

I think we should switch away from html scraping if possible. It looks like there are a few api's available this was the top google hit and it it seems like it could work: https://github.com/tapasweni-pathak/Horoscope-API

— Reply to this email directly or view it on GitHub https://github.com/CloudBotIRC/CloudBot/issues/199#issuecomment-148370220 .

edwardslabs commented 8 years ago

If there is no free api maybe horoscope gets dropped since maintaining an HTML scraping plugin can be pretty burdensome.

nasonfish commented 8 years ago

For what it's worth, here's an updated version of the plugin, but where to go from here is debatable, if we should just keep supporting this site or not.

# Plugin by Infinity - <https://github.com/infinitylabs/UguuBot>

import requests
from bs4 import BeautifulSoup

from cloudbot import hook
from cloudbot.util import formatting

@hook.on_start()
def init(db):
    db.execute("create table if not exists horoscope(nick primary key, sign)")
    db.commit()

@hook.command(autohelp=False)
def horoscope(text, db, bot, notice, nick):
    """<sign> - get your horoscope"""

    headers = {'User-Agent': bot.user_agent}

    # check if the user asked us not to save his details
    dontsave = text.endswith(" dontsave")
    if dontsave:
        sign = text[:-9].strip().lower()
    else:
        sign = text

    db.execute("create table if not exists horoscope(nick primary key, sign)")

    if not sign:
        sign = db.execute("select sign from horoscope where "
                          "nick=lower(:nick)", {'nick': nick}).fetchone()
        if not sign:
            notice("horoscope <sign> -- Get your horoscope")
            return
        sign = sign[0]

    url = "http://my.horoscope.com/astrology/free-daily-horoscope-{}.html".format(sign)

    try:
        request = requests.get(url, headers=headers)
        request.raise_for_status()
    except (requests.exceptions.HTTPError, requests.exceptions.ConnectionError) as e:
        return "Could not get horoscope: {}.".format(e)

    soup = BeautifulSoup(request.text)

    title = soup.find_all('h1', {'class': 'f40'})
    if not title:
        return "Could not get the horoscope for {}.".format(text)

    title = title[0].text.strip()
    horoscope_text = soup.find('div', {'class': 'block-horoscope-text'}).text.strip()
    result = "\x02{}\x02 {}".format(title, horoscope_text)
    result = formatting.strip_html(result)

    if text and not dontsave:
        db.execute("insert or replace into horoscope(nick, sign) values (:nick, :sign)",
                   {'nick': nick.lower(), 'sign': sign})
        db.commit()

    return result

CloudBotIRC / CloudBot

Horoscope plugin's HTML scraping no longer matches the page we scrape #199