EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

New Wikipedia Connectors #327

Open jhammock opened 8 years ago

jhammock commented 8 years ago

(using github rather than jira to facilitate communication.)

Let us start with an investigation, but if it turns out to be as promising as it sounds, this is probably a good time to update our wikipedia and wikimedia commons connectors to leverage wikidata taxon identifiers. This was @hyanwong 's suggestion, and you may want to leverage something of his along the way. He has been busy with wikidata lately.

excerpts from gitter discussion below.

@hyanwong 14:00 @jhammock I'm extracting all known taxa from WD (1492469 taxa at the latest count) and using that to (...) c) find Wikipedia entries for all taxa without having to go through taxonomic name resolution. With (c) I can also look at page sizes and page visits for all taxa, and hence calculate popularity measures. Not sure if any of this is of interest/use to you. Let me know if so

I think getting Wikipedia page names for harvest (along with wikimedia image categories / galleries) via wikidata is far more robust (and potentially faster) nowadays than looking for taxonomic information on the pages themselves.

Going via Wikidata means you should automatically get the non-english language pages for taxa. Happy to chat with Eli if he needs some pointers. Wikidata has pointers to taxon pages for all languages (they are called 'sitelinks') e.g. look at the bottom of https://www.wikidata.org/wiki/Q737838

hyanwong commented 8 years ago

My suggestion is that instead of parsing through the wikimedia/pedia wikitext, looking for Taxonavigation templates, instead EoL should go to the wikidata JSON dump ( 'latest' at https://dumps.wikimedia.org/wikidatawiki/entities/), and look through each line to find lines that have property P31 ("instance of") set to Q16521 ("taxon"). These lines should have 'sitelinks' fields which point to the relevant wikimedia commons gallery, and the various language wikipedia pages (e.g. see https://www.wikidata.org/wiki/Q36611). In the statements section, many of these pages will also have a link to the wikimedia commons 'category', which is usually a more comprehensive source of images.

eliagbayani commented 8 years ago

Thanks @hyanwong, input much appreciated. I will probably have a couple more questions when I get to this task.

hyanwong commented 8 years ago

@eliagbayani sure. This is quite a large task, though, which requires rewriting the wikimedia and wikipedia harvesting routines (I rewrote the wikimedia one a few years ago, so am probably the most familiar with it). Not sure if the idea is to start porting these to Ruby where possible. Maybe ask @JRice?

KatjaSchulz commented 7 years ago

@eliagbayani We're excited about the prospect of harvesting Wikipedia in all available languages. It would probably be best to establish a separate resource for each language, but have all the resources united in the Wikipedia content partner. This would allow us to have different harvesting schedules for different languages, and we would get an idea of the amount of content available for different taxa in different languages.

eliagbayani commented 7 years ago

Some stats from WikiData JSON dump last March 6, 2017.