calthoff / self_taught

This repository contains the exercises for "The Self-Taught Programmer: The Definitive Guide to Programming Professionally."
http://theselftaughtprogrammer.io
267 stars 226 forks source link

Webscraping Code Outputted Nothing to Shell #21

Open wesleyhedrick opened 6 years ago

wesleyhedrick commented 6 years ago

I am really excited about the potential of webscraping! But when I ran the webscraping code from chapter 20, it output nothing to the interactive shell. When I pressed run, the interactive shell window came to the foreground, but there was nothing in it.

In case I had mistyped the code, I decided to copy and paste right from tinyurl. Still nothing.

Please help.

calthoff commented 6 years ago

Hello! I rechecked the web scraper code and it is working just fine on my end. Please post a question (and make sure to include your actual code) in the Self-Taught Programmers Facebook group: https://www.facebook.com/groups/selftaughtprogrammers/.

jsteve427 commented 5 years ago

I'm having the same problem. The code runs, but it doesn't print anything, despite copy/pasting the code from here. I even posted a question to the FB group and the responses unfortunately didn't help. I first tried running the code on Win. 10 then Linux Mint 18 (currently don't have access to a Mac) to see if that would change anything, but it didn't.

John-m555 commented 5 years ago

I have the same problem. So I add one line to print the "url" as below.

    for tag in sp.find_all("a"):
        url = tag.get("href")
        print("\n" + url)           << added !
        if url is None:
            continue
        if "html" in url:
            print("\n" + url)

With this addition, you will have list of "url". Copy the out put to text editor and try to find some URL actually shown in the https://news.google.com You will see non of news article link on the web is shown in the list of "url"

It means non of text data with "href" tag in "url" match with actual URL on the Google news home page.......... Did we get right "sp" by Beartifulsoup ??

John-m555 commented 5 years ago

Guys,
Now I found what's wrong. Our program is not wrong but the HTML of the https://news.google.com must be changed after this book was written. As I stated above, current HTML doesn't contain any html as the list of "url" doesn't contain any of URL.

Then, try to change the ULR to check as below. You will have complete list of the link on the Yahoo home page as same as the book. (this works only as of 28-Jul, 2018)

news = "https://www.yahoo.com/" Scraper(news).scrape()

sui74 commented 5 years ago

news = "https://www.yahoo.com/" Scraper(news).scrape()

The code worked correctly, thank you.

EvanKardos42 commented 5 years ago

this is a simple fix and I believe this is a good way to practice your problem solving skills as it just requires you to think about what is happening in the algorithm and what it is looking at.

SPOILERS done below:

so here is the thing about the webscraper. it looks for the links that ends with "html" but google has it were its a link to link( its weird). for example looking at the website and copying the link on the first link(1/24/2019) you will get this:

"https://news.google.com/articles/CAIiEFK2EuxWltO3z6pUctxA2HwqFwgEKg8IACoHCAowjuuKAzCWrzww9oEY?hl=en-US&gl=US&ceid=US%3Aen"

that is not the link to the actual article itself this is: "https://www.nytimes.com/2019/01/24/us/politics/senate-vote-fails-shutdown.html"

it a link to a link and i dont know why Google does this