Build a generic RDFa importer for Zotero

mbjones commented 10 years ago

Organizational Page: Zotero Category: Coding Title: Build a generic RDFa importer for Zotero Proposed by: Dave Vieglais Participants: Summary: RDFa is a format for embedding rich metadata into HTML pages. Zotero is a free, extensible, open source online citation manager that uses importers to scrape or otherwise extract citation information from web pages. In this session, a new Zotero importer will be developed that parses RDFa in web pages to enable import into a users citation database. Specifications for web developers will also be developed, providing guidelines for representing citation information in web pages using RDFa. Technologies: Javascript, RDFa (http://rdfa.info/).

lewismc commented 10 years ago

Hi @mbjones I am project Chair of Apache Any23 [0]. Any23 is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents. It is one of the primary technokogies used on the RDFa Test Suite http://rdfa.info/test-suite/

I also develop Apache Tika [1] and a host of other technologies under the Apache brands so I am planning on attending this workshop and also participating in as many sessions as I can.

What steps do we need to take to make the proposed session morec concrete?

Thanks Lewis

[0] http://any23.apache.org [1] http://tika.apache.org

mbjones commented 10 years ago

Hi @lewismc Thanks for the offer of help. This session could be more concrete if you worked with @vdave on a set of specific planned products that might come out of the session and better described what would need to be done during the session.

datadavev commented 10 years ago

Zotero uses "translators" to scape or otherwise extract content from web pages (https://github.com/zotero/translators). These are Javascript modules that are utilized by Zotero when visiting a particular site, falling back to more general translators when a site specific one is not available. The majority of these translators are site specific. One exception is the COinS translator which works with all sites. COinS is somewhat limiting in the metadata that can be expressed, and it is also not a particularly friendly format. There is also an RDF translator, though this applies to RDF/XML documents rather than HTML with RDFa markup.

My suggestion for this session is to determine the feasibility of, and perhaps even develop a prototype translator for Zotero that operates similarly to the COinS translator in that it extracts citation metadata from HTML, but instead detects and extracts content expressed in RDFa rather than COinS tags.

To achieve this, it will be necessary to develop an understanding of the Zotero translator API, outline some recommendations for embedding citation information using RDFa, have a few test case documents that include RDFa markup for citations (ideally following guidelines from http://bibliontology.com/), and of course, develop the translator most likely by working from the COinS translator and also using elements from other translators such as the RDF translator or any others that happen to leverage microformats of some type.

williamgunn commented 10 years ago

We spoke a little bit about this in the session just now and I mentioned PRISM as a fairly well accepted standard, adopted by Google Scholar and Mendeley. It's Dublin Core and not RDFa, but maybe gets us to the same place? At the very least, it would be a good template.

paregorios commented 9 years ago

I hope you wouldn't mind some comments from someone who just stumbled across this thread:

Zotero supports the bibliontology vocabulary (which incorporates some terms from prism, dcterms, etc.) in its most robust RDF export format. I'd +1 the creation of a translator that understands the same terminological space as the existing RDF+Bibliontology export but expressed in HTML5+RDFa. I'd also be interested in collaborating on same.

FWIW, the small publications list on my home page is marked up with Bibliontology-aware RDFa as a possible example use case.

lewismc commented 9 years ago

@paregorios I wish that more personal webpages were marked up like yours :) I just ran it through Any23 via our live service demo and got a bunch of structure back http://any23-vm.apache.org/ We appear to extract all structure marked up with your biblo RDFa. Unfortunately I was not able to contribute towards this session during the OSCodefest but got your update as I was originally interested in this thread.

mbjones commented 9 years ago

Same here, I was unable to attend the session. Maybe @vdave or @williamgunn can update us as to progress. It seems to me that using Any23 to translate various input tagging schemes into something that Zotero can handle would be really useful and broaden the biblio markup that Zotero can handle.

chrismattmann commented 9 years ago

That would be feed otero via any23

chrismattmann commented 9 years ago

Would be rad to feed

lewismc commented 9 years ago

@vdave the main issue I have with your description of the zotero drivers is

they are site/domain specific
they are JavaScript. How can I purposefully use them in s side applications? @vdave essentially what you describe within your comment on Aug 7 is Any23!!!

You can find more about Apache Any23 here http://any23.apache.org Please come back here and comment if you guys are interested in driving this onwards. Thanks Lewis

lewismc commented 9 years ago

Hi @vdave I've scanned the Zotero repos. I am kind of overwhelmed about the shear volume of site specific, vocabulary specific JavaScript you have! Who maintains all of these files? Do you track provenance? How do you invoke Zotero? I think you guys must support a whole variety of broken vocabularies, I wonder if you ever looked into Any23? I would really like to work on this with you guys. Thanks Lewis

lewismc commented 9 years ago

OK, I am really interested and utterly confused now The documentation [0] states that all Zotero translators should be disseminated under AGPL!!!! Most of the code under the JS codebase is not licensed at all and it seems to be a free for all. I am pretty confused as to what Zotero is trying to achieve here and as to what end this entire JavaScript repository is meant to be! I wold really appreciate if someone could enlighten us to the strategic vision for the entire Zotero JS codebase as well as the project as a whole. It looks like some files were last edited 8 years ago! Thanks Lewis

[0] https://www.zotero.org/support/dev/translators

On Tue, Nov 25, 2014 at 3:28 PM, Matt Jones notifications@github.com wrote:

Same here, I was unable to attend the session. Maybe @vdave https://github.com/vdave or @williamgunn https://github.com/williamgunn can update us as to progress. It seems to me that using Any23 to translate various input tagging schemes into something that Zotero can handle would be really useful and broaden the biblio markup that Zotero can handle.

— Reply to this email directly or view it on GitHub https://github.com/NCEAS/open-science-codefest/issues/6#issuecomment-64490448 .

Lewis

paregorios commented 9 years ago

So, I can't speak to the licensing, but here's what I understand (from the outside looking in) to be the Zotero "translators" use cases:

Facilitate scraping by humanist scholars and students with varying levels of computational acumen of bibliographic data from websites using the browser-embedded or browser-enabled standalone client
Facilitate import of bibliographic data in various formats from file, again using the client software
Maybe the Write API is involved too? I don't know.
In light of limited central resources and wide variation in what's in the wild on the web that humanists might want to cite and capture, enable a broadly collaborative environment for "translator" development, with the idea that many contributors will be new to software development

I can't speak to the choice of JavaScript, but the level of technical hurdle for new coders coming from the humanities might have been a factor. Also, Zotero began, if I understand it right, as a browser plugin/client, so those origins could well have played a role.

paregorios commented 9 years ago

So, as to the overall direction of this thread, I think there are two use cases:

Enable the Zotero clients to scrape some species of RDFa from web pages. There is currently some support in the Zotero translator base for RDFa in meta tags only.
Enable an import-to-Zotero workflow via Apache Any23, leveraging Any23's ability to detect, extract, interpret, and normalize a wide range of metadata from web pages.

Is that right?

adam3smith commented 9 years ago

I develop Zotero translators, though I can't speak for the project as a whole.

Most of the code under the JS codebase is not licensed at all and it seems to be a free for all.

you can assume AGPL for all Zotero code. We're in the process of clarifying this for translators, but note that unlicensed code is, by US law, not "free for all" but subject to standard copyright, so more restricted than AGPL.

I am pretty confused as to what Zotero is trying to achieve here and as to what end this entire JavaScript repository is meant to be!

Zotero translators are a part of working and very successful software with a user base of several 100k (mostly) academics. The translators enable Zotero users to import data from a huge number of sites with very different availability of structured metadata into the Zotero reference management software. That includes sites with no structured data at all, for which we simply scrape information off the page via xpaths, sites which have structured data in a variety of format (MARC, MODS, RIS etc.) one GET/POST request away, as well as sites with embedded metadata in various formats (though currently not RDFa, hence, I assume, the initial request). Given this landscape, site-specific code is the only available option. Hence the large number of individual files.

That said, not all of the translators are site specific and an RDFa translator obviously wouldn't be. As suggested by @vdave above, the COinS translator (one of the four translators run on every site) could serve as a model.

Who maintains all of these files?

Several volunteers under the roof of the Zotero project at GMU.

Do you track provenance?

Not very seriously, no, though the creators of individual files are listed in the JSON headers.

How do you invoke Zotero?

The files are actually shipped with and run as part of Zotero and invoked via https://github.com/zotero/zotero/blob/4.0/chrome/content/zotero/xpcom/translation/translate.js

We have recently seen the first serious attempts to use the translators outside of Zotero: https://github.com/proquest/pme apparently with good success.

There are several reasons for JavaScript, but the simplest one is that Zotero as a whole is written in JavaScript/XULrunner

lewismc commented 9 years ago

@adam3smith thank you for the background and explanations

lewismc commented 9 years ago

@paregorios I would be interested in working on an Any23 --> Zotero plugin/code. The only thing I am thinking here is that there maybe already exists a generic RDF importer for Zotero? @adam3smith can you comment here? Where are the curremt importers for Zotero? Can you point me to the code is there are a collection of them? Thanks

NCEAS / open-science-codefest

Build a generic RDFa importer for Zotero #6