π Library for scraping news data from major newspapers in US & UK.
Gather complete text data including headlines and full article text over for Natural Language Processing projects.
URLs for individual articles are scraped from the news site, then the library newspaper3k
is used to curate article content.
This project was originally intended to be used to track the relationship between the sentiment of Covid-19 related articles in reported news and returns in the stock market, therefore in sentiment.py
the Harvard IV/Lasswell Pyschosocial dictionary is loaded and may be used to create a basic Document-Term matrix. The intention was to further expand this with feature-based sentiment analysis (pre-reading content included in repo for interested parties) but this project has now been abandoned for the foreseeable future.
π°π¬π§ Guardian
π°πΊπΈ NYPost
π°πΊπΈ WallStreetJournal
π°πΊπΈ LATimes
π°πΊπΈ NYTimes
π°πΊπΈ MotherJones
π°πΊπΈ PBS [Unfinished]
π°π¬π§πΊπΈ DailyMail
The original intention was to gather data from a much larger group both geographically and politically, but the underlying research project was abandoned as a matter of priority. If you would like me to add a newspaper source please add it as an issue and I will add it.
I make no promise of maintenance if existing newspaper sites change.
git clone https://github.com/SJDunkelman/newspaper_scraper.git
cd newspaper_scraper
Due to the sequential nature of news site layouts the articles must be found and downloaded from today back to a date of your choosing.
In main.py
change LAST_DATE
to the date you would like to scrape backwards to.
Then run in Terminal:
python main.py
If the fastparquet
or pyarrow
library is installed then the dataframe will be saved as a parquet binary file, reducing size. If not, the final dataframe is saved as a CSV file and looks like (if full_text=True
for Scraper):
Title | Source | Text | Date |
---|---|---|---|
Man arrested after... | Guardian | Three men were arrested... | 1/1/2020 |
New record set in... | Guardian | A new world record was... | 1/1/2020 |
The base class of Newspaper
in newspapers.py
makes it quick and easy to add a new newspaper source, and at the time of creation (2020) the already completed examples covered all general layouts news sites typically use. On each news site you typically either browse all articles chronologically by altering a URL pattern, or you simulate endless scrolling on a load on scroll dynamic web page.