Integrate paper metadata handling with Wikidata

This post is inspired by the BibTeX from Wikidata functionality described in https://larsgw.blogspot.de/2016/09/citationjs-on-command-line.html .

Some thoughts on how to integrate ContentMine's paper metadata handling with Wikidata:

if a ContentMine pipeline (or any reference file in BibTeX or similar format, for that matter) touches bibliographic metadata of scholarly articles, check whether Wikidata items for these articles already exist (e.g. via P932, P356, P698).
- if yes, it might simply trigger an integrity check of these metadata, perhaps identify the main topic (P921) or do nothing for the moment
- if no, it should start the missing items with at least some basic properties (e.g. P31:Q13442814 and the respective value for a persistent identifier). If this would leave the items incomplete with respect to Wikidata's data model for scholarly articles, the missing pieces could be handled by the mostly existing pipelines around constraint violations.
in addition to the existing ContentMine pipelines to search by dictionaries, it might be interesting to have some functionality to search the literature (across all or selected dictionaries) by contributions from particular authors, institutions, journals, dates or some such, with which Wikidata could help
What about running ContentMine over Wikipedia dumps to identify facts?
- if these facts are referenced on Wikipedia to scholarly sources, ContentMine could check whether the indicated sources actually support the statement, and flag cases where that's not clear
- if the Wikipedia statements lack scholarly references, ContentMine might be able to find some
- as above, the metadata of the scholarly references would go to Wikidata, from where it might be pulled into the respective Wikipedia article by way of some variant of Module:Cite.

These are definitely interesting ideas.

The first idea appears to me to be the most readily actionable at the moment. What sort of work flow do you envisage for this? Hope this doesn't seem like a barrage of questions but I just want to check I've understood this all right.

Should we build a bot and seek approval from the community to add these items? In the call we discussed principally interacting via the primary-sources tool although we mostly talked about 'facts' not paper metadata. Perhaps given that this data is already 'curated' by either the NCBI or CrossRef this isn't such a problem.

Which metadata should we consider adding? If we are looking at all the new publications on a given day should these all be added to Wikidata? My impression from wikicite was that this kind of blanket adding where there is neither a structural need or a lack of real notability for a given publication should be avoided (and perhaps go into librarybase instead?).

One of the places I started with librarybase before was only importing works that were referenced on enwiki (but we could choose any/all wikis)?

I think having Wikidata entries for all works cited from enwiki is reasonable, and expansion to all works cited by any Wikimedia project (at least from their content namespaces) should come as soon as possible thereafter.

Blanket addition beyond that may cause problems but still makes sense in the long run, as that would contribute to the goal of turning Wikidata into an open citation graph. So anything cited anywhere from "within scope" would eventually get a Wikidata item, and after some time, I could well imagine encouraging people to upload their BibTeX files to some tool on Wikimedia labs that would then check these files against the Wikidata corpus and add info / flag inconsistencies as needed.

Perhaps we can start by sharing in a standard fashion the publications that CM has read on a given day, perhaps along with things mined from them? We could then go over that feed and hopefully become more specific about the respective workflows for Wikidata/ Librarybase, and demo things with Zika.

I think these are all in scope for the WikiFactMine project - it will depend on details.

On Mon, Oct 10, 2016 at 4:46 AM, Daniel Mietchen notifications@github.com wrote:

This post is inspired by the BibTeX from Wikidata functionality described in https://larsgw.blogspot.de/2016/09/citationjs-on-command-line.html .

Yes - Lars has done a super job.

Some thoughts on how to integrate ContentMine's paper metadata handling with Wikidata:

if a ContentMine pipeline (or any reference file in BibTeX or similar format, for that matter) touches bibliographic metadata of scholarly articles, check whether Wikidata items for these articles already exist (e.g. via P932, P356, P698).

if yes, it might simply trigger an integrity check of these metadata, perhaps identify the main topic (P921) or do nothing for the moment

if no, it should start the missing items with at least some basic properties (e.g. P31:Q13442814 and the respective value for a persistent identifier). If this would leave the items incomplete with respect to Wikidata's data model for scholarly articles, the missing pieces could be handled by the mostly existing pipelines around constraint violations.

I think this is a great place to start learning about Wikidata and normalized metadata. Because the scholarly literature is not pr

-

in addition to the existing ContentMine pipelines to search by dictionaries, it might be interesting to have some functionality to search the literature (across all or selected dictionaries) by contributions from particular authors, institutions, journals, dates or some such, with which Wikidata could help

Yes - bibliography is starting to emerge as critical and I think we can and should address it. We can't do all of it, but it needs ot be integrated into the facts.

What about running ContentMine over Wikipedia dumps to identify facts?

if these facts are referenced on Wikipedia to scholarly sources, ContentMine could check whether the indicated sources actually support the statement, and flag cases where that's not clear

if the Wikipedia statements lack scholarly references, ContentMine might be able to find some

as above, the metadata of the scholarly references would go to Wikidata, from where it might be pulled into the respective Wikipedia article by way of some variant of Module:Cite.

I'll bounce this around with Magnus

-

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/126, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxSx2nXGs7meBHAVdzPYuUsVAJNgKYks5qybUKgaJpZM4KSNUs .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

ContentMine / getpapers

Integrate paper metadata handling with Wikidata #126