A short exercise in Web scraping.
00-init.r
after installing its package dependencies.01-data.r
as many times as needed to collect all raw data and process it.02-plots.r
to generate summary plots.Note: the small issues raised up by data collection might get fixed.
ala_data.csv
Contains information on all A List Apart articles published between 1999 (inception) and 2016:
date
-- date of publication, in yyyy-mm-dd
formaturl
-- URItags
-- topic(s), semicolon-separated (see ala_tags.csv
for a description)au_url
-- URI(s) of the author(s), semicolon-separatedau_id
-- name(s) of the author(s), semicolon-separatedtitle
-- titledescription
-- short article description, from the <meta>
tagA single article is missing (/article/xhtml
), and A List Apart blog posts published between 2013 and 2015 are downloaded but excluded from the ala_data.csv
dataset.
ala_refs.csv
Contains the edge list of article cross-citations:
i
-- url
of the citing article (source)j
-- url
of the cited article (target)n
-- number of times the citation occurs in the sourceNote that a few (7) articles do not show up as sources in the edge list because of HTML parsing errors. The problem is explained in detailed in this Stack Overflow post.
ala_tags.csv
Contains the general and specific topics of the articles:
parent
-- general topictag
-- specific (child) topicSince general topics are also used to categorise articles, the parent
and tag
columns (parent and child) are sometimes identical.
All A List Apart articles are Copyright © 1998–2017 A List Apart & Authors.
Please do not redistribute the raw data for this project.