RoheLab / aPPR

Approximate Personalized Page Rank
https://rohelab.github.io/aPPR/
Other
16 stars 3 forks source link

abstract_graph() for wikipedia hyperlinks. #12

Open karlrohe opened 4 years ago

karlrohe commented 4 years ago

Would be really cool to sample wikipedia hyperlink graph.

Wikipedia requests "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia." https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

So, would it be ok if we limited the number of pages downloaded? I don't know what a good number is. Is 50k too high?

Alternatively, that link above describes how one can download the data in bulk.

alexpghayes commented 4 years ago

aPPR probably isn't fast enough to worry about this too much, but either way, Wikimedia has an open API that we could just use to pull data (including a custom variant of SQL for knowledge graphs called SPARQL!!). See https://github.com/bearloga/WikidataQueryServiceR for some details.

@bearloga is there any easy sample code we could riff off of to (locally, not globally) find all pages linked from a given wikipedia page?

bearloga commented 4 years ago

@alexpghayes thanks for the shoutout and ping! :D

Here are some possible options, assuming you mean Wikipedia pages linked to from a given Wikipedia page (as opposed to external links in References sections):

But yes, please don't just download a bunch of Wikipedia articles with a crawler. The Wikimedia Foundation is a non-profit organization with strict privacy & security policies, so we maintain our own data centers and do not rely on external CDNs like Cloudflare to distribute the burden of hosting and serving free knowledge.

Hope that helps!

Edit: pyWikiMM seems interesting/promising

karlrohe commented 4 years ago

whoa. I thought clickstream was page views (node counts).... but that is another data set.

clickstream is actual clicks (edge counts). That is amazing. Last month was less than 500 mb for english. totally do-able and actually more/better/interesting-er than simply hyperlinks.

However, for simplicity, what about wikipediR::page_links?

bearloga commented 4 years ago

Oh! Yeah, totally! WikipediR::page_links would be great. Internally it calls the MediaWiki API, which is much better than web-scraping.

A few recommendations:

library(WikipediR)
linx <- page_links(
  "en", "wikipedia",
  page = "Aaron Halfaker|Hadley Wickham",
  namespaces = 0
)

linx$query$pages will be a list with 2 elements, one for each article. As an example, the result can be made into a tibble with:

library(purrr)
map_dfr(
  x$query$pages,
  function(page) {
    tibble::tibble(source = page$title, target = map_chr(page$links, ~ .x$title))
  }
)
source target
Aaron Halfaker ACM Digital Library
Aaron Halfaker Arnnon Geshuri
Aaron Halfaker Artificial intelligence
... ...
Hadley Wickham Tidy data
Hadley Wickham Tidyverse
Hadley Wickham University of Auckland

I don't think it accepts more than 50 at a time though. Also depending on the character length of the titles, concatenating too many may hit the URI length limit. think <2000 characters is the rule of thumb.

karlrohe commented 4 years ago

This is super helpful. Thank you @bearloga !

alexpghayes commented 4 years ago

Thanks @bearloga!! Karl, as a side note, all of the APPR internals request serially.

karlrohe commented 4 years ago

I assumed that we requested serially. That's good.