abstract_graph() for wikipedia hyperlinks.

karlrohe commented 4 years ago

Would be really cool to sample wikipedia hyperlink graph.

Wikipedia requests "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia." https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

So, would it be ok if we limited the number of pages downloaded? I don't know what a good number is. Is 50k too high?

Alternatively, that link above describes how one can download the data in bulk.

alexpghayes commented 4 years ago

aPPR probably isn't fast enough to worry about this too much, but either way, Wikimedia has an open API that we could just use to pull data (including a custom variant of SQL for knowledge graphs called SPARQL!!). See https://github.com/bearloga/WikidataQueryServiceR for some details.

@bearloga is there any easy sample code we could riff off of to (locally, not globally) find all pages linked from a given wikipedia page?

bearloga commented 4 years ago

@alexpghayes thanks for the shoutout and ping! :D

Here are some possible options, assuming you mean Wikipedia pages linked to from a given Wikipedia page (as opposed to external links in References sections):

The most straight-forward and efficient option is to use the clickstream dataset, with the caveat that that's based on actual users' visits/clicks, as opposed to based on content (see announcement post and example dataviz post) – that is, just because page Y is linked to from page X, if in a given month nobody ever clicks on Y from X, that pair won't show up in that month's clickstream snapshot
Alternatively, you could query dbpedia's SPARQL endpoint
- See this post for more details and this post for examples
- {WikidataQueryServiceR} README has links for learning SPARQL and the WDQS User Manual has a ton of examples
The most exhausting and expensive option is to download a database backup dump, ideally from one of the mirrors, extract & transform the relevant data, and load it somewhere where you can query it

But yes, please don't just download a bunch of Wikipedia articles with a crawler. The Wikimedia Foundation is a non-profit organization with strict privacy & security policies, so we maintain our own data centers and do not rely on external CDNs like Cloudflare to distribute the burden of hosting and serving free knowledge.

Hope that helps!

Edit: pyWikiMM seems interesting/promising

karlrohe commented 4 years ago

whoa. I thought clickstream was page views (node counts).... but that is another data set.

clickstream is actual clicks (edge counts). That is amazing. Last month was less than 500 mb for english. totally do-able and actually more/better/interesting-er than simply hyperlinks.

However, for simplicity, what about wikipediR::page_links?

bearloga commented 4 years ago

Oh! Yeah, totally! WikipediR::page_links would be great. Internally it calls the MediaWiki API, which is much better than web-scraping.

A few recommendations:

Specify namespaces = 0 to limit links within the (Article) namespace
If fetching links recursively to grow the graph outward from the source node, combine the children with | to limit the number of individual API requests (per etiquette guidelines), for example:

library(WikipediR)
linx <- page_links(
  "en", "wikipedia",
  page = "Aaron Halfaker|Hadley Wickham",
  namespaces = 0
)

linx$query$pages will be a list with 2 elements, one for each article. As an example, the result can be made into a tibble with:

library(purrr)
map_dfr(
  x$query$pages,
  function(page) {
    tibble::tibble(source = page$title, target = map_chr(page$links, ~ .x$title))
  }
)

source	target
Aaron Halfaker	ACM Digital Library
Aaron Halfaker	Arnnon Geshuri
Aaron Halfaker	Artificial intelligence
...	...
Hadley Wickham	Tidy data
Hadley Wickham	Tidyverse
Hadley Wickham	University of Auckland

I don't think it accepts more than 50 at a time though. Also depending on the character length of the titles, concatenating too many may hit the URI length limit. think <2000 characters is the rule of thumb.

It may be tempting to parallelize the process, but by "making your requests in series rather than in parallel, by waiting for one request to finish before sending a new request, should result in a safe request rate" (also from the etiquette guidelines)

karlrohe commented 4 years ago

This is super helpful. Thank you @bearloga !

alexpghayes commented 4 years ago

Thanks @bearloga!! Karl, as a side note, all of the APPR internals request serially.

karlrohe commented 4 years ago

I assumed that we requested serially. That's good.

RoheLab / aPPR

abstract_graph() for wikipedia hyperlinks. #12