SocratesClub / cc2017

南京大学《计算传播》2017春季课程
http://computational-communication.com/
MIT License
25 stars 10 forks source link

复杂网络常用数据集网站 #2

Open chengjun opened 7 years ago

chengjun commented 7 years ago

作者:Hdevin
来源:http://www.jianshu.com/p/9313bc75c94b

2016.09.21 08:36

复杂网络的研究很多都离不开数据集,下面这些是个人在做科研的过程中在互联网上搜集到的一些数据集网站,列举出来也方便同行们去使用。

1、http://vladowiki.fmf.uni-lj.si/doku.php?id=pajek:data:urls:index

数据集网站集合。这个网站中列出来很多数据集网站,非常全,其中的很多网站数据描述详细,而且数据可以直接下载,大家可以从中选出一些网站收藏起来。

2、http://snap.stanford.edu/data/

这个是斯坦福大学的大型网络数据集网站,大家应该比较熟悉了。

3、http://konect.uni-koblenz.de/

这个网站是我个人最喜欢也是最常用的网站,里面数据集有上百种,数据的分类和描述真的是特别详细,而且还给出了数据集的可视化图以及一些基本统计特性,所有数据均可以直接下载。

4、http://networkrepository.com/index.php

这是另一个个人特别喜欢也是很常用的网站,和上面一样数据分类也是相当的详细,大家应该都能找到自己想要的数据。

5、http://gdm.fudan.edu.cn/GDMWiki/Wiki.jsp?page=Network%20DataSet

这是复旦大学建立的网络数据集网站,里面有一些常用数据集以及一些相关资源网站。

6、https://toreopsahl.com/datasets/

这个网站里面有十几个数据集,包含社交网络、交通网络、合作网络等常用数据集。

7、http://netwiki.amath.unc.edu/SharedData/SharedData

这个网站列出了一些数据集以及一些复杂网络大牛的个人数据网站,大家可以去看看。

8、http://vlado.fmf.uni-lj.si/pub/networks/data/

这个是Pajek网站提供的数据集,里面的数据都很经典,复杂网络早期研究中很多数据集都是源于这里。

9、http://socialcomputing.asu.edu/pages/datasets

社交网络数据集,里面包含国内外一些常用在线社区网站的数据集,中型到大型的数据规模,搞社会计算的同行可能用的比较多。

10、http://www.sociopatterns.org/datasets/

另一个社交网络数据集,这里面的数据集更多倾向于实体网络,比如联系网、高校朋友网、疾病传播网。

11、http://www3.nd.edu/~networks/resources.htm

CCNR,大牛Barabási教授个人的数据网站,当然了除开数据集,这个网站上有很多可以学习的东西。

12、http://www-personal.umich.edu/~mejn/netdata/

大牛Newman教授的个人数据网站,里面的数据集特别是科学家合作网络,用到特别多。

以上是一些个人用到比较多的复杂网络数据网站,鉴于自己的了解有限,可能对这些数据集网站给出的描述还不够详细,而且很多单位都没有列出来,敬请谅解。当然了如果想用里面的数据集,一定要记得引用数据集作者们的信息,他们收集数据并公开方便大家使用也是很不容易的。如果转载本文,也请注明出处^_^.

如果有补充信息也可以联系我:hdevin@outlook.com

chengjun commented 7 years ago

title: "互联网行为数据列表" date: 2013-12-15 20:24 comments: true categories:

Publicly Accessible Datasets of Human Online Behavior

Edited by Lingfei Wu & Chengjun Wang Sep.15, 2011

Refer to google docs

New Added Data are listed below:


2012


ICWSM 2011 Spinn3r Dataset http://www.icwsm.org/data/

That dataset, provided by Spinn3r.com, is a continuation of the 2009 Spinn3r Dataset. The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset’s time period).

1.Reddit Voting data

http://www.reddit.com/r/redditdev/comments/lowwf/attempt_2_want_to_help_reddit_build_a_recommender/

  1. Tencent weibo mining

http://www.kddcup2012.org/c/kddcup2012-track1/data

  1. Rating Analysis and online forum mining of apple

http://sifaka.cs.uiuc.edu/~wang296/Data/index.html

  1. Web Research Collections of blog and government website

http://ir.dcs.gla.ac.uk/test_collections/

  1. Query Data of Microsoft Research

QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research.

Data Set: http://research.microsoft.com/en-us/downloads/d6e8c8f2-721f-4222-81fa-4251b6c33752/

Paper: http://research.microsoft.com/en-us/people/hangli/qru-1.pdf

A new public dataset for promoting query representation and understanding research, referred to as QRU-1, was recently released by Microsoft Research. The QRU-1 dataset contains reformulations of Web TREC topics that are automatically generated using a large-scale proprietary web search log, without compromising user privacy. In this paper, we describe the content of this dataset and the process of its creation. We also discuss the potential uses of the dataset, including a detailed description of a query reformulation experiment.


2011 & Before


Project 1. Stanford Large Network Dataset Collection

http://snap.stanford.edu/data/index.html

● Socialnetworks: online social networks, edges represent interactions between people

● Communicationnetworks: email communication networks with edges representing communication

● Citationnetworks: nodes represent papers, edges represent citations

● Collaborationnetworks: nodes represent scientists, edges represent collaborations (co-authoring a paper)

● Webgraphs: nodes represent webpages and edges are hyperlinks

● Amazonnetworks : nodes represent products and edges link commonly co-purchased products

● Internetnetworks : nodes represent computers and edges communication

● Roadnetworks : nodes represent intersections and edges roads connecting the intersections

● Autonomoussystems : graphs of the internet

● Signednetworks : networks with positive and negative edges (friend/foe, trust/distrust)

● Wikipedianetworksandmetadata : Talk, editing and voting data from Wikipedia

● TwitterandMemetracker : Memetracker phrases, links and 467 million Tweets

Social networks Name Type Nodes Edges Description soc-Epinions1 Directed 75,879 508,837 Who-trusts-whom network of Epinions.com soc-LiveJournal1 Directed 4,847,571 6,8993,773 LiveJournal online social network soc-Slashdot0811 Directed 77,360 905,468 Slashdot social network from November 2008 soc-Slashdot0922 Directed 82,168 948,464 Slashdot social network from February 2009 wiki-Vote Directed 7,115 103,689 Wikipedia who-votes-on-whom network Communication networks Name Type Nodes Edges Description email-EuAll Directed 265,214 420,045 Email network from a EU research institution email-Enron Undirected 36,692 367,662 Email communication network from Enron wiki-Talk Directed 2,394,385 5,021,410 Wikipedia talk (communication) network Citation networks Name Type Nodes Edges Description cit-HepPh Directed, Temporal, Labeled 34,546 421,578 Arxiv High Energy Physics paper citation network cit-HepTh Directed, Temporal, Labeled 27,770 352,807 Arxiv High Energy Physics paper citation network cit-Patents Directed, Temporal, Labeled 3,774,768 16,518,948 Citation network among US Patents Collaboration networks Name Type Nodes Edges Description ca-AstroPh Undirected 18,772 396,160 Collaboration network of Arxiv Astro Physics ca-CondMat Undirected 23,133 186,936 Collaboration network of Arxiv Condensed Matter ca-GrQc Undirected 5,242 28,980 Collaboration network of Arxiv General Relativity ca-HepPh gUndirected 12,008 237,010 Collaboration network of Arxiv High Energy Physics ca-HepTh Undirected 9,877 51,971 Collaboration network of Arxiv High Energy Physics Theory Web graphs Name Type Nodes Edges Description web-BerkStan Directed 685,230 7,600,595 Web graph of Berkeley and Stanford web-Google Directed 875,713 5,105,039 Web graph from Google web-NotreDame Directed 325,729 1,497,134 Web graph of Notre Dame web-Stanford Directed 281,903 2,312,497 Web graph of Stanford.edu Product co-purchasing networks Name Type Nodes Edges Description amazon0302 Directed 262,111 1,234,877 Amazon product co-purchasing network from March 2 2003 amazon0312 Directed 400,727 3,200,440 Amazon product co-purchasing network from March 12 2003 amazon0505 Directed 410,236 3,356,824 Amazon product co-purchasing network from May 5 2003 amazon0601 Directed 403,394 3,387,388 Amazon product co-purchasing network from June 1 2003 amazon-meta Metadata 548,552 1,788,725 Amazon product metadata: product info and all reviews on around 548,552 products. Internet peer-to-peer networks Name Type Nodes Edges Description p2p-Gnutella04 Directed 10,876 39,994 Gnutella peer to peer network from August 4 2002 p2p-Gnutella05 Directed 8,846 31,839 Gnutella peer to peer network from August 5 2002 p2p-Gnutella06 Directed 8,717 31,525 Gnutella peer to peer network from August 6 2002 p2p-Gnutella08 Directed 6,301 20,777 Gnutella peer to peer network from August 8 2002 p2p-Gnutella09 Directed 8,114 26,013 Gnutella peer to peer network from August 9 2002 p2p-Gnutella24 Directed 26,518 65,369 Gnutella peer to peer network from August 24 2002 p2p-Gnutella25 Directed 22,687 54,705 Gnutella peer to peer network from August 25 2002 p2p-Gnutella30 Directed 36,682 88,328 Gnutella peer to peer network from August 30 2002 p2p-Gnutella31 Directed 62,586 147,892 Gnutella peer to peer network from August 31 2002 Road networks Name Type Nodes Edges Description roadNet-CA Undirected 1,965,206 5,533,214 Road network of California roadNet-PA Undirected 1,088,092 3,083,796 Road network of Pennsylvania roadNet-TX Undirected 1,379,917 3,843,320 Road network of Texas Autonomous systems graphs Name Type Nodes Edges Description as-733(733 graphs) Undirected 103-6,474 243-13,233 733 daily instances(graphs) from November 8 1997 to January 2 2000 as-Skitter Undirected 1,696,415 11,095,298 Internet topology graph, from traceroutes run daily in 2005 as-Caida (122 graphs) Directed 8,020-26,475 36,406-106,762 The CAIDA AS Relationships Datasets, from January 2004 to November 2007 Oregon-1(9 graphs) Undirected 10,670-11,174 22,002-23,409 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 Oregon-2 (9 graphs) Undirected 10,900-11,461 31,180-32,730 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 Signed networks Name Type Nodes Edges Description soc-sign-epinions Directed 131,828 841,372 Epinions signed social network wiki-Elec Directed, Bipartite ~7,000 ~100,000 Wikipedia adminship election data soc-sign-Slashdot081106 Directed 77,357 516,575 Slashdot Zoo signed social network from November 6 2008 soc-sign-Slashdot090216 Directed 81,871 545,671 Slashdot Zoo signed social network from February 16 2009 soc-sign-Slashdot090221 Directed 82,144 549,202 Slashdot Zoo signed social network from February 21 2009 Wikipedia networks and metadata Name Type Nodes Edges Description wiki-Vote Directed 7,115 103,689 Wikipedia who-votes-on-whom network wiki-Talk Directed 2,394,385 5,021,410 Wikipedia talk (communication) network wiki-Elec Bipartite ~7,000 ~100,000 Wikipedia adminship election data wiki-meta Edits 2.3M users, 3.5M pages 250M edits Complete Wikipedia edit history (who edited what page) Memetracker and Twitter Name Type Nodes Edges Description twitter7 Tweets 17,069,982 users 476,553,560 tweets A collection of 476 million tweets collected between June-Dec 2009 memetracker9 Memes 96 million 418 million links Memetracker phrases and hyperlinks between 96 million blog posts from Aug 2008 to Apr 2009 ksc-time-series Time Series 2,000 418 million links Time series of volume of 1,000 most popular Memetracker phrases and 1,000 most popular Twitter hashtags Project 2. MemeTracker

http://www.memetracker.org/data.html

MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly. Overall we track more than 17 million different phrases and about 54% of the total phrase/quote mentions appear on blos and 46% in news media.

Dataset size: 220M (clustered), 13.3 G (Zipped raw data over 9 months)

Example of a record in the file: lines below map to the fields above. First line is record A, followed by B and 3 C records. Then another B and 2 C records.

2 8 we’re not commenting on that story i’m afraid 2131865

3 3 we’re not commenting on that 489007

2008-08-18 14:23:05 1 M http://business.theage.com.au/business/bb-chief-set-to-walk-plank-20080818-3xp7.html

2008-11-26 01:27:13 1 B http://sfweekly.com/2008-11-26/news/buy-line

2008-11-27 18:55:30 1 B http://aconstantineblacklist.blogspot.com/2008/11/re-researcher-matt-janovic.html

5 2 we’re not commenting on that story 2131864

2008-12-08 14:50:18 3 B http://videogaming247.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee

2008-12-08 19:35:31 2 B http://jplaystation.com/2008/12/08/home-in-10-days-were-not-commenting-on-that-story-says-scee

Project 3. PINTS – Experiments Data Sets

http://www.uni-koblenz-landau.de/koblenz/fb4/AGStaab/Research/DataSets/PINTSExperimentsDataSets/index_html

Dataset Users Tags Resources Tag assignm. Download Flickr 319,686 1,607,879 28,153,045 112,900,000 flickr_UsrResTag.7z (518 MB) packed with 7zip Delicious 532,924 2,481,698 17,262,480 140,126,586 delicious_UsrResTag.7z (848 MB) packed with 7zip Project 4. K. Lerman’s datasets

http://www.isi.edu/integration/people/lerman/downloads.html

Digg 2009 This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg. Flickr personal taxonomies This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

Wrapper maintenance Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.

Project 5. Facebook datasets

http://odysseas.calit2.uci.edu/doku.php/public:online_social_networks#available_datasets

Facebook social graph

The following datasets are collected in April of 2009 through data scraping from Facebook :

  1. MHRW – A sample of 957K unique users obtained Facebook-wide by 28 independent Metropolis-Hastings random walks

  2. UNI – A sample of 984K unique users that represents the “ground truth” i.e., a truly uniform sample of Facebook userIDs, selected by a rejection sampling procedure from the system’s 32-bit ID space.

UserIDs are consistent across files. UserIDs and networkIDs are anonymized. The mapping for networkIDs is available upon request.

Facebook applications Dataset I contains the number of active users and total application installations daily for every Facebook application between 08/29/2007 and 02/14/2008 . The data was retrieved from the Adonomics website, which had been collecting aggregate applications statistics, Daily Active Users (DAU) and Application Installs, by scraping the Facebook application directory.

Dataset I comprises of 16,812 files (one file for each application present in the Facebook application directory until 02/14/2008).

Dataset II is collected in February 2008 and contains a list of installed applications for 297K Facebook users.

Facebook weighted random walks

The following datasets is collected in October of 2010 through data scraping from Facebook :

RW – A sample of 1M unique users obtained Facebook-wide by 25 independent simple Random Walks Hybrid – A sample of 1M unique users obtained Facebook-wide by 25 independent Stratified Weighted Random Walks (S-WRW) with hybrid conflict resolution. The measurement objective in the Hybrid sample are Facebook users with college network membership. For each dataset, we release two files. The first file contains for each sampled userID, (i) the weight of the sampled user, (ii) the number of vfriends, the visitable friends during the social graph exploration (or friends for which “View Friends”=1), (iii) the total number of friends , and (iv) list of networkIDs of which the user is a member of.

Project 6. Network data released by Eric D. Kolaczyk

http://math.bu.edu/people/kolaczyk/datasets.html

A list of datasets used in the book Statistical Analysis of Network Data are provided. For each of those datasets available, the author has combined a data file(s) with a README file, in the format of a compressed ZIP file. In the README file are given a description of the data, a brief characterization of the context in which they arise, and relevant information on their source.

3 of all the datasets concenring human online behavior are worth paying attention to.

AIDS blogs: Network of citations among blogs related to AIDS, patients, and their support networks, collected by Gopal, over a three-day period in August 2005.

Packet delay: Packet delay data from Coates et al. resulting from an Internet packet probing experiment designed for conducting network topology inference.

Router-level Internet: A network representation of a portion of the router-level Internet, based on topology discovery measurements collected between April 21 and May 8, 2003 by the skitter measurement system atCAIDA.

Project 7. The Internet Radar Datasets

http://www-rp.lip6.fr/~latapy/Radar/

The authors design and implement an ego-centered measurement tool, and perform radar-like measurements consisting of repeated measurements of the internet topology. They conduct long-term (several weeks) and highspeed (one round every few minutes) measurements of this kind from more than one hundred monitors, and provide the obtained data, including the time-variant traffic and topology datasets.

Project 8. The eDonkey Datasets

http://www-rp.lip6.fr/~latapy/P2P_data/

The authors presents a capture of the queries managed by an eDonkey server during almost 10 weeks, leading to the observation of almost 9 billion messages involving almost 90 million users and more than 275 million distinct files.

The data is splitted into one directory per week, one subdirectory per day, one subsubdirectory by hour, and in each of these subsubdirectories three files:

FileSearch.xml.gz contains file searches, i.e. queries based on keywords and metadata sent by clients, and answers (lists of FileId, filenames and metadata) from the server;

SourceSearch.xml.gz contains source searches, i.e. queries send by clients to find providers for given FileId, and answers (lists of providers) from the server;

Main.xml.gz contains basically all other queries, sent to the server to know its load (number of clients and files), to get a textual description of the server, or to get the list of other servers the server knows, and corresponding answers from the server. Notice that the server may also send this kind of queries to other servers, which we then store in this file.

Project 9. The books and music on Amazon

http://www.lambiotte.be/data.html

The author collect the time-variant ranking record of 111 music and books on Amazon and the user data from audioscrobbler, a music library of more than 30000 persons.

Project 10. The University of Florida Sparse Matrix Collection

http://www.cise.ufl.edu/research/sparse/matrices/groups.html

The University of Florida Sparse Matrix Collection is a large and actively growing set of sparse matrices that arise in real applications. The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms. Its 172 matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs).

Project 11. Facebook-like social Networks

http://toreopsahl.com/datasets/#online_social_network

The Facebook-like Social Network originate from an online community for students at University of California, Irvine. The dataset includes the users that sent or received at least one message (1,899). A total number of 59,835 online messages were set over 20,296 directed ties among these users. This network has also been described in Patterns and Dynamics of Users’ Behaviour and Interaction: Network Analysis of an Online Community and used in a number of articles including Prominence and control: The weighted rich-club effect and Clusteringinweightednetworks. Although this dataset contains many nodal attributes (e.g., gender, age, and course attended), these are not made available as it would be possible to reverse engineer the anonymisation procedure of users.

Weighted longitudinal one-mode network (weighted by number of characters): tnet-format

Binary longitudinal one-mode network: tnet-format

Weighted static one-mode network (weighted by number of characters): tnet-format; UCINET-format

Weighted static one-mode network (weighted by number of messages): tnet-format; UCINET-format

Project 12. The CAIDA Anonymized 2008-2011 Internet Traces Dataset

http://www.caida.org/data/passive/passive_2011_dataset.xml

This dataset contains anonymized passive traffic traces from CAIDA’s equinix-chicago and equinix-sanjose monitors on OC192 Internet backbone links. This data is useful for research on the characteristics of Internet traffic, including application breakdown, security events, geographic and topological distribution, and flow volume and duration.

The first traffic trace available is a 1 hour traffic trace collected during the DITL 2008 measurement event. This trace contains anonymized packet headers in pcap format on a single direction of the bidirectional OC192 link at equinix-chicago from approximately 2008-03-19 19:00 to 20:00 UTC. The hardware monitoring the other direction of the link was not functioning properly at the time of the traffic capture, so only data for a single direction was captured.

For the equinix-chicago monitor, the first monthly bidirectional traffic trace was taken on April 30 2008, and added to the Anonymized 2008 Internet Trace dataset in June 2008. This 1 hour trace resulted in 83 GB of compressed pcap files. The first monthly bidirectional traffic trace from the equinix-sanjose monitor was taken on July 17 2008.

Traffic traces in this dataset are anonymized using CryptoPAn prefix-preserving anonymization. All traces in this dataset are anonymized with the same key. In addition, the payload has been removed from all packets.

Information on the Anonymized 2009-2011 InternetTracesDatasetcan also be found.

Project 13. Wikipedia Data from Amazon

http://aws.amazon.com/search?searchPath=datasets&searchQuery=wikipedia&x=0&y=0

WikipediaTrafficStatisticsV2

Contains 16 months of hourly pageview statistics for all articles in Wikipedia

WikipediaExtraction (WEX)

A processed dump of the English language Wikipedia

WikipediaPageTrafficStatistics

Contains 7 months of hourly pageview statistics for all articles in Wikipedia

WikipediaPageTrafficStatisticV3

This dataset contains a 150 GB sample of the data used to power trendingtopics.org . It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

DBpedia 3.5.1

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web

Wikipedia XML Data

A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.

Project 14. CiteULike data

http://static.citeulike.org/data/current.bz2

The CiteULike database is potentially useful for researchers in various fields. Physicists and computer scientists have expressed an interest in trying to analyse the structure of the data, and frequently ask for datasets to be made available.

Who-posted-what data The latest data snapshot can always be downloaded at http://static.citeulike.org/data/current.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/2007-05-30.bz2

Data is available from 2007-05-30 onwards.

The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available. The advantage is that it’s available in one file rather than having to spider the entire site to get at the information (please don’t do that!).

The file is a simple unix (“\n” line endings) text file with pipe (“|”) delimiters. The columns are:

The CiteULike article id which was posted An obfuscated representation of the username (a salted MD5 hash of the true username). Again, it is possible to piece back together what the true username is by scraping the site, but I’d rather you didn’t do that. The reason I’ve gone to the trouble of obfuscation is primarily a slightly paranoid anti-spam measure The date and time the article was posted to the site The tag the user used to post it NB If a user posts an article with n tags, then this will result in n rows in the file

Article linkout data Mapping CiteULike article_ids to resources on the web can be done with the linkout table. The current snapshot is available at http://static.citeulike.org/data/linkouts.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/linkouts-2008-02-02.bz2

Data is available from 2008-02-02 onwards.

To understand the data in this file, you should refer to “The linkout formatter” section of the plugindeveloper‘sguide.

This file contains a number of spam links. Although CiteULike filters spam postings, traces of the spam still remain in this table. In time this spam content will eventually be removed.

The file is a simple unix (“\n” line endings) text file with pipe (“|”) delimiters. Literal pipes within the fields are represented escaped (“|”). The columns are:

Article Id Linkout type ikey_1 ckey_1 ikey_2 ckey_2 NB If an article has n linkouts, then this will result in n rows in the file.

Group membership data The latest data snapshot can always be downloaded at http://static.citeulike.org/data/groups.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/groups-2008-11-14.bz2

Data is available from 2008-11-14 onwards.

The file constitutes an anonymous dump of who is a member of each group. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available. The advantage is that it’s available in one file rather than having to spider the entire site to get at the information (please don’t do that!).

The file is a simple unix (“\n” line endings) text file with pipe (“|”) delimiters. The columns are:

An obfuscated group identifier. An obfuscated representation of the username (a salted MD5 hash of the true username, in the same way as is done for the article posting data). NB If a group has n members, then this will result in n rows in the file.

Project 15. The Swarmagent Dataset

http://www.swarmagents.com/thesis/detail.asp?id=350

The dataset contains the IDs of users on the brainstorm forum of the Swarmagent Club, the contents of the threads they initiate or reply, and the date and time they do these. The dataset covers the period from the 2003 to Sep. 10, 2010. A readme.txt summarizing the dataset is also provided.

The raw data is in XML format. ACSV version of the dataset can be download from (thanks to Sijie Liu’s efforts) http://www.bjt.name/wp-content/uploads/2010/12/sw_data.rar

Project 16. The Hyperreal User Browsing Dataset

http://www.cs.washington.edu/research/adaptive/download.html

The dataset contains the user logs come from the Music Machines web site at Hyperreal. They have been anonymized (stripped of all information about users except for their succession of accesses to the site).

Each file contains all accesses to Music Machines for a single day. The accesses are organized into paths. Each path is the series of URLs requested from a particular machine. Note that we do not distinguish among multiple users coming from the same source. We have, however, disabled caching of pages at the site so that every page must be requested, even when revisited.

A typical path will appear as below. The first line contains the originating machine (converted to unique numbers for the sake of anonymity). Each succeeding line corresponds to one URL requested from that machine. Each request contains the originating machine (O), the time of the request (T), the URL requested (U), and the referring URL (R). Fields are separated by “||”.

—O:0000002560—

O:0000002560 || T:1997/09/12-22:43:00 || U:/ || R:http://www.hyperreal.org/

O:0000002560 || T:1997/09/12-22:50:27 || U:/categories/software/ || R:http://www.hyperreal.org/music/machines/

O:0000002560 || T:1997/09/12-22:50:38 || U:/categories/software/Windows/ || R:http://www.hyperreal.org/music/machines/categories/software/

O:0000002560 || T:1997/09/12-22:50:47 || U:/categories/software/Windows/V909V03.TXT || R:http://www.hyperreal.org/music/machines/categories/software/Windows/

O:0000002560 || T:1997/09/12-22:51:06 || U:/categories/software/Windows/ || R:http://www.hyperreal.org/music/machines/categories/software/

O:0000002560 || T:1997/09/12-22:51:18 || U:/categories/software/Windows/ravemusc.txt || R:http://www.hyperreal.org/music/machines/categories/software/Windows/

Files are named m.YYMMDD.paths, where YYMMDD represents the date. The authors trained on one month of data and tested on another ten days. The logs from September and October 1997 are presented in one ZIP file per month.

ir.uiowa.edu/polisci_nump/index.2.html

Twitter 2010 data set http://www.isi.edu/integration/people/lerman/load.html?src=http://www.isi.edu/~lerman/downloads/twitter/twitter2010.html