calthoff / tstp

This is an old repository for the exercises in "The Self-Taught Programmer." Please see /selftaught.
167 stars 190 forks source link

Chapter 20, Challenge 1 (Web Scraper) #8

Open jonathan-j-stone opened 7 years ago

jonathan-j-stone commented 7 years ago

The linked solution in the book, [(http://tinyurl.com/gkv6fuh)], when run, returns url's just as the practice example did. The same url's, in fact.

I thought it was supposed to return Headlines.

There's also so much new material in this chapter that goes unexplained that I didn't feel I had any hope of solving the challenge. Even if the solution worked, I don't understand the code.

totodo713 commented 5 years ago

Me too! I think google news site front-end change html format to js format. My URL-collecting here. (So sorry, if I get a miss.)

import urllib.request
from bs4 import BeautifulSoup

class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html, parser)
        articles = set()
        for tag in sp.find_all("a"):
            article = tag.get("href")
            if article is None:
                continue
            if "articles" in article:
                articles.add(article) if article not in articles else None

        urls = set()
        for i, article in enumerate(articles):

            r = urllib.request.urlopen(self.site + article[2:])
            html = r.read()
            parser = "html.parser"
            sp = BeautifulSoup(html, parser)
            title = sp.find('title').text.replace("Google News - ", "")

            if len(title) > 0:
                for tag in sp.find_all("a"):
                    url = tag.get("href")
                    if url is None:
                        continue
                    if "html" in url:
                        urls.add(url)

        with open(f"./out/news.txt", "w", encoding="utf8") as f:
            for url in urls:
                f.write(f"{url}\n")

if __name__ == '__main__':
    news = "https://news.google.com/"
    Scraper(news).scrape()