amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

formatting wiki pagerank #160

Open havnar opened 10 years ago

havnar commented 10 years ago

I couldn't find any links to the wiki dataset used so I downloaded them from wikimedia. When I run the pagerank I get weird page titles though, so midway the code I wanted to know what titles were beeing utilised. Is this normal:

(also: where can I find the proper dataset used in the amplab)

scala> vertices.take(50)

res14: Array[(org.apache.spark.graphx.VertexId, String)] = Array((0,""), (0,""), (0,""), (1728454431,* Toby, Marlene. ''A.A. Milne, Author of Winnie-the-Pooh''. Chicago: Childrens Press, 1995. ISBN), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (103299066,|honorific-prefix), (117890311,|name), (191986873,|honorific-suffix), (-644639137,|image), (1667503647,|order1), (-188590855,|office1), (1077807430,|term_start1), (-866339411,|term_end1), (-1122312309,|monarch1), (1499463684,|governor-general1), (-1568689980,|predecessor1), (-217292473,|successor1), (1685055658,|birth_date), (708509579,|birth_place), (368907285,|death...

Final output:

printing the top 10 ranked pages:

''7.08: 0.15

color:oligocene bar:NAM21 from:: 0.15

|url = http://books.google.ca/books?id=aQ84ViBNkYwC&lpg=PR1&dq=Michael%20Jordan&pg=PR1#v=onepage&q&f=true|publisher=Greenwood Press |isbn=: 0.15

*Twelve Foot Change: 0.15

''(0.02/.08): 0.15

In mammals and birds, sleep is divided into two broad types: [[rapid eye movement sleep|rapid eye movement]](REM sleep) and [[non-rapid eye movement sleep|non-rapid eye movement]](NREM or non-REM sleep). Each type has a distinct set of physiological and neurological features associated with it. REM sleep is associated with the capability of dreaming.<ref name="National">{National Institute of Neurological Disorders and Stroke. (21 May 2007). Brain basics:: 0.15

*2035: 0.15

commands: 0.15
39411: 0.15

QJT 2½: 0.15

printing the most important page within the subgraph of Wikipedia that mentions Berkeley in the title:

By contrast, [[John von Neumann|von Neumann]] recommended against floating point for the 1951 [[IAS machine]], arguing that fixed point arithmetic was preferable.<ref>{{cite web|url=http://www.cs.berkeley.edu/~wkahan/SIAMjvnl.pdf|title=The: 0.15

Zuse also proposed, but did not complete, carefully rounded floating–point arithmetic that would have included ±∞ and NaNs, anticipating features of IEEE Standard floating–point by four decades.<ref name=kahansiam>{{cite web|url=http://www.cs.berkeley.edu/~wkahan/SIAMjvnl.pdf|title=The: 0.15

* [[Mary Elizabeth Barry|Berry, Mary Elizabeth]]. (2006). ''Japan in Print: Information and Nation in the Early Modern Period.'' Berkeley: University of California Press.: 0.15

* Glahn, Richard Von. (1996). ''Fountain of Fortune: Money and Monetary Policy in China, 1000-1700.'' Berkeley: University of California Press.: 0.15