Please assess if this text only site is suitable for scraping

ricjhill commented 3 years ago

The news sources we are using have complex front ends. There used to be text only versions (BBC, Guardian) which are easier to scrape. They seem to be deprecated. Are these replacments useful?

https://guardian.gyford.com/

dudeperf3ct commented 3 years ago

@ricjhill Nice catch! I was able to get guardian easily around 26k posts. BBC is proving to be a challenge.

alexn11 commented 3 years ago

Yeah, for BBC, I had to use selenium to navigate through the search pages and extract a list of links from them, then I could just scrape the links as normal. Even with that there was eventually some error stopping the program from going through all the search pages. The simpler the page the better.

ricjhill commented 3 years ago

whats the error? I can dig into it if you like

alexn11 commented 3 years ago

There are two types of errors: sometimes the content of the page is "stale" idk why, the other is when the program cannot find the "next page" element to click on (despite being at a page number below the stated total number of pages). This happens in BBC-non-climate.py lines 121 and 130 respectively. Despite that got 1400+ links so it's not a huge issue. Feel free to look it up if you want (could probably get several thousands articles more)

ricjhill commented 3 years ago

: sometimes the content of the page is "stale"

could be cached content from a content delivery network , Do you get a HTTP error? Which process raises the error?

the program cannot find the "next page" element to click on

I dont know what that could be.

alexn11 commented 3 years ago

> could be cached content from a content delivery network , Do you get a HTTP error? Which process raises the error?

Idk how to chek that. After the "get", I find a "ol" list and loop over the "li" elements, it's only when I get the href attributes of a "a" tag in the element that I get a "StaleElementReferenceException".

dudeperf3ct commented 3 years ago

the program cannot find the "next page" element to click on

If you visit the homepage of BBC for climate, there is pagination of 50 pages. The logic for scraper would to be get URL for these next pages. But the URL present in pagination under div class qa-pagination-right is /false/page/2 for which page does not exist. That's why it's difficult to scrape BBC climate page. If there is any way we can find the specific URL of the next page maybe we can scrape. But after clicking the next page, the same URL similar to the first page appears.

alexn11 commented 3 years ago

the program cannot find the "next page" element to click on

If you visit the homepage of BBC for climate, there is pagination of 50 pages. The logic for scraper would to be get URL for these next pages. But the URL present in pagination under div class qa-pagination-right is /false/page/2 for which page does not exist. That's why it's difficult to scrape BBC climate page. If there is any way we can find the specific URL of the next page maybe we can scrape. But after clicking the next page, the same URL similar to the first page appears.

That's why I use selenium for that. Find the 'qa-pagination-next-page' then just "click()", the URL is determined dynamically & prolly session dependent.

ricjhill commented 3 years ago

Closing this. It appears you have a workaround for the scraping difficultiies

ricjhill commented 3 years ago

https://www.dailymail.co.uk/textbased/channel-1/index.html

ClimateMisinformation / Scrapers

Please assess if this text only site is suitable for scraping #1