Open karlrohe opened 4 years ago
aPPR
probably isn't fast enough to worry about this too much, but either way, Wikimedia has an open API that we could just use to pull data (including a custom variant of SQL for knowledge graphs called SPARQL!!). See https://github.com/bearloga/WikidataQueryServiceR for some details.
@bearloga is there any easy sample code we could riff off of to (locally, not globally) find all pages linked from a given wikipedia page?
@alexpghayes thanks for the shoutout and ping! :D
Here are some possible options, assuming you mean Wikipedia pages linked to from a given Wikipedia page (as opposed to external links in References sections):
But yes, please don't just download a bunch of Wikipedia articles with a crawler. The Wikimedia Foundation is a non-profit organization with strict privacy & security policies, so we maintain our own data centers and do not rely on external CDNs like Cloudflare to distribute the burden of hosting and serving free knowledge.
Hope that helps!
Edit: pyWikiMM seems interesting/promising
whoa. I thought clickstream was page views (node counts).... but that is another data set.
clickstream is actual clicks (edge counts). That is amazing. Last month was less than 500 mb for english. totally do-able and actually more/better/interesting-er than simply hyperlinks.
However, for simplicity, what about wikipediR::page_links?
Oh! Yeah, totally! WikipediR::page_links
would be great. Internally it calls the MediaWiki API, which is much better than web-scraping.
A few recommendations:
namespaces = 0
to limit links within the (Article) namespace|
to limit the number of individual API requests (per etiquette guidelines), for example:library(WikipediR)
linx <- page_links(
"en", "wikipedia",
page = "Aaron Halfaker|Hadley Wickham",
namespaces = 0
)
linx$query$pages
will be a list with 2 elements, one for each article. As an example, the result can be made into a tibble with:
library(purrr)
map_dfr(
x$query$pages,
function(page) {
tibble::tibble(source = page$title, target = map_chr(page$links, ~ .x$title))
}
)
source | target |
---|---|
Aaron Halfaker | ACM Digital Library |
Aaron Halfaker | Arnnon Geshuri |
Aaron Halfaker | Artificial intelligence |
... | ... |
Hadley Wickham | Tidy data |
Hadley Wickham | Tidyverse |
Hadley Wickham | University of Auckland |
I don't think it accepts more than 50 at a time though. Also depending on the character length of the titles, concatenating too many may hit the URI length limit. think <2000 characters is the rule of thumb.
This is super helpful. Thank you @bearloga !
Thanks @bearloga!! Karl, as a side note, all of the APPR internals request serially.
I assumed that we requested serially. That's good.
Would be really cool to sample wikipedia hyperlink graph.
Wikipedia requests "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia." https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
So, would it be ok if we limited the number of pages downloaded? I don't know what a good number is. Is 50k too high?
Alternatively, that link above describes how one can download the data in bulk.