Extracting metadata from COinS and Dublin Core

mfenner commented 9 years ago

How good are the metadata provided by COinS, Dublin Core, and related tags (see http://scholar.google.com/intl/en/scholar/inclusion.html#indexing). Do we not need a specific web translator when these are used, or is there usually something missing?

aurimasv commented 9 years ago

Using Dublin Core and other similar vocabularies (particularly HighWire and PRISM) in meta tags, you can specify very rich metadata (we refer to this as Embedded Metadata, btw). You can even specify PDF URLs for automatic attachment (I don't believe that's possible with any other generic translator, but don't quote me on that). One downside of using these tags is that you cannot specify more than one item per page. Another serious shortcoming of Embedded Metadata atm is that it is at the bottom of the food chain of all translators. That is, all other generic translators, including DOI, COinS, and unAPI, will override anything you specify in the meta tags. So if you have a DOI on the page, or embed some COinS, those are the import options you will get, not Embedded Metadata. In Firefox, you can right-click the URL bar icon and select Embedded Metadata translator even if DOI or COinS get precedence, but this is not currently possible in Chrome/Safari/Opera. We've been planning for a long time to merge all of these generic translators into a single embedded metadata translator that would combine these different types of metadata, which would solve the above issue. Unfortunately, I don't think this is going to be happening any time soon.

As for COinS itself, the metadata is generally pretty poor, because the specification does not have a rich vocabulary, but the COinS translator will fetch additional metadata if there is a DOI (or ISBN, I believe) specified in the COinS. So you can get pretty good results in those cases.

I think unAPI is a pretty nice option, since it allows you to use any of the file import formats that Zotero supports (e.g. BibTeX, RIS, Zotero RDF, etc.) Unfortunately, it's a little more involved to set up. (edit: I guess I shouldn't say "any format". The list can be seen here)

mfenner commented 9 years ago

Excellent, thanks for the detailed summary. My example use case is the Dryad digital repository, datasets with DOIs, e.g. this one: http://datadryad.org/resource/doi:10.5061/dryad.781pv

Dryad has the concept of a data package that can contain multiple datasets, and they use COinS and Dublin Core, and DataCite DOIs. So there is a lot going with regards to what you described above. I look forward to learn how best to approach this when writing a translator, and how to improve what we get now.

One outcome of the workshop could also be to give recommendations to web site providers, e.g. don't use COinS but embedded metadata, or make sure you embed the DOI in COinS (it seems that Zotero grabs the handle rather than the doi from Dryad pages).

aurimasv commented 9 years ago

You can even specify PDF URLs for automatic attachment (I don't believe that's possible with any other generic translator, but don't quote me on that).

I was wrong (again) on this one. See https://forums.zotero.org/discussion/40556/openurlcoins-and-automatic-pdf-download/

adam3smith / webinar-translators

Extracting metadata from COinS and Dublin Core #1