SJDunkelman / read-all-about-it

πŸ—ž Scrape headline & article text data from major US/UK Newspaper sources for use in NLP projects
GNU General Public License v3.0
1 stars 1 forks source link
nlp scraping text-mining

Read All About It

Newspaper Text Data Scraper


πŸ—ž Library for scraping news data from major newspapers in US & UK.

Gather complete text data including headlines and full article text over for Natural Language Processing projects.

URLs for individual articles are scraped from the news site, then the library newspaper3k is used to curate article content.

This project was originally intended to be used to track the relationship between the sentiment of Covid-19 related articles in reported news and returns in the stock market, therefore in sentiment.py the Harvard IV/Lasswell Pyschosocial dictionary is loaded and may be used to create a basic Document-Term matrix. The intention was to further expand this with feature-based sentiment analysis (pre-reading content included in repo for interested parties) but this project has now been abandoned for the foreseeable future.

Table of Contents

  1. Installation
  2. Usage
  3. Development

Newspapers Supported

πŸ“°πŸ‡¬πŸ‡§ Guardian

πŸ“°πŸ‡ΊπŸ‡Έ NYPost

πŸ“°πŸ‡ΊπŸ‡Έ WallStreetJournal

πŸ“°πŸ‡ΊπŸ‡Έ LATimes

πŸ“°πŸ‡ΊπŸ‡Έ NYTimes

πŸ“°πŸ‡ΊπŸ‡Έ MotherJones

πŸ“°πŸ‡ΊπŸ‡Έ PBS [Unfinished]

πŸ“°πŸ‡¬πŸ‡§πŸ‡ΊπŸ‡Έ DailyMail

The original intention was to gather data from a much larger group both geographically and politically, but the underlying research project was abandoned as a matter of priority. If you would like me to add a newspaper source please add it as an issue and I will add it.

I make no promise of maintenance if existing newspaper sites change.

Installation


git clone https://github.com/SJDunkelman/newspaper_scraper.git
cd newspaper_scraper

Usage


Due to the sequential nature of news site layouts the articles must be found and downloaded from today back to a date of your choosing.

In main.py change LAST_DATE to the date you would like to scrape backwards to.

Then run in Terminal:

python main.py

Output

If the fastparquet or pyarrow library is installed then the dataframe will be saved as a parquet binary file, reducing size. If not, the final dataframe is saved as a CSV file and looks like (if full_text=True for Scraper):

Title Source Text Date
Man arrested after... Guardian Three men were arrested... 1/1/2020
New record set in... Guardian A new world record was... 1/1/2020

Development


The base class of Newspaper in newspapers.py makes it quick and easy to add a new newspaper source, and at the time of creation (2020) the already completed examples covered all general layouts news sites typically use. On each news site you typically either browse all articles chronologically by altering a URL pattern, or you simulate endless scrolling on a load on scroll dynamic web page.