andreantonacci / everynoise_scraper

Two webscrapers to collect data from everynoise.com
4 stars 2 forks source link

Scraper for everynoise.com

Overview

This repository contains two webscrapers to collect data from everynoise.com.

(1) New releases: A list of (weekly) album and single releases to Spotify, by country Screenshot

The data is scraped from everynoise.com/new_releases_by_genre.cgi.

(2) Worldbrowser: A list of "promoted"/"featured" playlists on Spotify, by playlist category, hour-of-the-day (if available), and country.

Screenshot

While the data is collected from everynoise.com/worldbrowser.cgi, the data actually comes directly from the Spotify Web API, which powers the browse interface of the Spotify platform.

Screenshot of the playlist browse feature on Spotify

Collecting the raw data

First, please install...

Then, you can run the data collections:

Documentation of output

The two webscrapers write the output of the data collections to JSON files.

(1) New releases

The data is written to new-line separated JSON files, named everynoise_newreleases_YYYYMMDD.json (whereas YYYYMMDD refers to the datestamp when the scraper was run. It lists the weekly releases to the Spotify platform by country. Each release is characterized by an albumId/albumName, and and associated artistName/artistId. The trackId in the data below represents a preview snipped of the album that users can click to listen to (a part) of the release. Singles are released as single-track albums.

JSON file structure

{
  "countryCode": "EC", # two-letter country code
  "trackId": "spotify:track:2rRhbOTbTwAUq45qdllfST", # Spotify track ID of a preview track of the album release
  "artistId": "spotify:artist:07YUOmWljBTXwIseAUd9TW", # Spotify artist ID of the album release
  "rank": "EC rank: 10", # Rank (probably popularity rank; exact definition is pending)
  "artistName": "Sebastián Yatra", # Artist name associated with album release
  "albumId": "spotify:album:2B4n5Uy0rYJ1btdqtUsrw8", # Spotify album ID
  "albumName": "Un Año (En Vivo)", # Album name
  "scrapeUnix": 1570447279, # Unix time stamp when the data was scraped
  "scrapeDate": "20191007", # Datestamp when the data was scraped
  "everynoiseDate": "20191004" # Date when track/album was released to Spotify
}

(2) Worldbrowser

The data is written to new-line separated JSON files, named everynoise_worldbrowser_YYYYMMDD__HHMM.json (whereas YYYYMMDD refers to the datestamp, and HHMM to the hour-minute timestamp when the scraper was run.

JSON file structure

{
  "sectionName": "featured",
  "countryName": "Global",
  "countryCode": "3",
  "playlistIdArray": [
    "spotify:playlist:37i9dQZF1DX3rxVfibe1L0",
    "spotify:playlist:37i9dQZF1DXcBWIGoYBM5M",
    "spotify:playlist:37i9dQZF1DX1s9knjP51Oa",
    "spotify:playlist:37i9dQZF1DX0XUsuxWHRQd",
    "spotify:playlist:37i9dQZF1DX4pUKG1kS0Ac",
    "spotify:playlist:37i9dQZF1DWSXBu5naYCM9",
    "spotify:playlist:37i9dQZF1DWXRqgorJj26U",
    "spotify:playlist:37i9dQZF1DX7ZUug1ANKRP",
    "spotify:playlist:37i9dQZF1DWWQRwui0ExPn",
    "spotify:playlist:37i9dQZF1DWYmmr74INQlb",
    "spotify:playlist:37i9dQZF1DX2Nc3B70tvx0",
    "spotify:playlist:37i9dQZF1DWVViFqIfGGV7"
  ],
  "scrapeUnix": 1572350843,
  "scrapeDate": "20191029",
  "everyNoiseHour": "08:07am",
  "everyNoiseHourReference": "-23"
}