id should be optional in Embedded Citation Object Format

simonster commented 13 years ago

Currently, the Embedded Citation Object Format schema suggests that an item ID is required. Since the item ID isn't used (items are identified by URIs), we should drop this property entirely or at least make it optional.

bdarcus commented 13 years ago

On Fri, Aug 5, 2011 at 1:55 AM, simonster reply@reply.github.com wrote:

Currently, the Embedded Citation Object Format schema suggests that an item ID is required. Since the item ID isn't used (items are identified by URIs), we should drop this property entirely or at least make it optional.

"Isn't used" by what subject? Zotero?

Not sure how I feel about leaving this too flexible, where some implementations choose one way to link citation and reference data, and others do something different?

Why can't the value of the id key be, in fact, a URI?

rmzelle commented 13 years ago

http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#citation-data-object states that an "id" is required.

simonster commented 13 years ago

In https://github.com/citation-style-language/schema/blob/master/csl-citation.json, the uris array (which is optional) contains the data that's actually necessary to citations with references. The item ID needs to be generated before passing the data into citeproc-js, but it's not of much use to store it in the document. (My implementation for Zotero ignores it, and it looks like Mendeley even overwrites it.) Item identifiers should go in the uris array, which allows multiple uris per reference so that a citation can be linked to references in different implementations or for different users. If more than one implementation were to use the id property to link citations to references, links would be lost when sharing a document between these implementations. If an implementation decides to use the id for something, it's doing things wrong.

bdarcus commented 13 years ago

OK, this is a different schema, the one for embedding in the document. Steve R. mostly wrote this. I was more concerned with the case for pandoc and such.

I'm still uncomfortable with this approach of tying a citation to particular service data records. I suppose so long as the docs are truly portable (e.g. they don't rely on these services being available to work), that's the main point.

Steve, why do we have this id key at all?

bdarcus commented 13 years ago

Actually, maybe the id should be required, where it refers to the id for the local (in this case, embedded) data (what citeproc-js is using for processing, for example), while the uris provide ways to get copies of that data?

simonster commented 13 years ago

The citation is tied to particular service data records so that, if you change an item in your Zotero library, it changes the corresponding citation in the document. There's no other way this is going to happen.

As an implementer, I'm not sure what the use case for the id is. I can embed one, but neither Zotero nor Mendeley use it. (Mendeley appears to set it to "ITEM-1" regardless of the item, so it's not even unique.) It seems like it just takes up space.

bdarcus commented 13 years ago

On Fri, Aug 5, 2011 at 12:07 PM, simonster reply@reply.github.com wrote:

The citation is tied to particular service data records so that, if you change an item in your Zotero library, it changes the corresponding citation in the document. There's no other way this is going to happen.

I think whatever solution should support both this requirement and the requirement that documents be portable/self-contained. I don't think they need be mutually-exclusive.

But on this one, simple case (and yes, I know a lot doesn't correspond to this):

cite = { "id": "foo", }

data = { "id": "foo", "doi": "10.x.1298983498" "title": "..." }

This allows matching the citation to the data, which solves the portability issue.

... and then separately:

accounts = [ { "service": "zotero", "username": "jdoe" }, { "service": "mendeley", "username": "jdoe" }, { "service": "google-scholar" } ]

... allows you to update the data (via the ordered list of data providers).

As an implementer, I'm not sure what the use case for the id is. I can embed one, but neither Zotero nor Mendeley use it. (Mendeley appears to set it to "ITEM-1" regardless of the item, so it's not even unique.) It seems like it just takes up space.

Here's the use case I want Zotero and Mendeley to solve:

Two users collaborate on a document: one uses Zotero, and the other uses Mendeley. Neither has accounts for the other.

The details are up to you ;-)

Hopefully Steve can jump in here, but I think it may be a few days. If you both think the id key doesn't matter, then we should remove it I guess.

simonster commented 13 years ago

The use case you mention is the use case that the schema is intended to solve. I'll wait for Steve's response on the id property.

SteveRidout commented 13 years ago

I agree with Simon, the item ID isn't required, either inside the "itemData" object or within an element of the "citationItems" array and is just there in Mendeley's implementation because we re-used the code which generates the JSON for citeproc.

Mendeley doesn't even try to read this ID so I have no problem dropping this, but be aware that the current version of Mendeley will add them in when it refreshes.

bdarcus commented 13 years ago

So can I just clarify how you guys plan to associate reference data with citation if you're not using id-based linking?

Because it would seem that embedding it would be a bad idea (not very scalable), and that's the only option I can see ATM.

SteveRidout commented 13 years ago

In addition to the "id" field which citeproc expects, the schema contains a "uris" array which is where we store any number of URIs for users of Mendeley, Zotero, or other ref manager software.

The "id" field is just an unused artefact because I re-used our code to create objects for citeproc.

bdarcus commented 13 years ago

On Mon, Aug 8, 2011 at 11:10 AM, SteveRidout reply@reply.github.com wrote:

In addition to the "id" field which citeproc expects, the schema contains a "uris" array which is where we store any number of URIs for users of Mendeley, Zotero, or other ref manager software.

A URI is just a global ID. You could just as easily do:

{

"id": "http://example.org/1"

}

What do you do with the data such that the documents are self-contained? I thought you were both embedding the source data as CSL JSON somewhere?

SteveRidout commented 13 years ago

Yes, we are embedding the necessary document metadata as JSON.

The reason we use the "uris" array instead of a single "id" object as in your example is that multiple users may have different URIs for the same document. The way Mendeley works at the moment is:

It iterates through the URIs, and as soon as it finds one that's in the user's Mendeley library, it will overwrite the embedded JSON metadata with the metadata in their Mendeley libary. (In future, we should probably prompt the user to resolve any conflicts between the embedded and library metadata at this point)
If the URI was not found (e.g. it was created by another Mendeley or Zotero user) it offers to import the document into the user's library.
- If the user opts to import the document, Mendeley will check for possible duplicate documents which already exist in his library, prompting the user to confirm the duplicate if the match isn't exact.
- If a duplicate is confirmed, it will add the URI of the existing document to the "uris" array.
- If a duplicate is not confirmed, it will be imported as a new document with a new unique URI which will be added to the "uris" array

bdarcus commented 13 years ago

Thanks. So two followups:

How are the URIs matched to the item? Is there a URI key in the latter as well, or do you use the id key for that?
For sake of argument, let's say Andrea wants to update citeproc-hs and pandoc to support this approach for ODT output. Is it straightforward for him to do that? In this case, keep in mind, people will be using local ids and databases for in-document linking; e.g. [@smith99].

SteveRidout commented 13 years ago

They are matched simply by both being members of the same citationItems array element.
Sorry, I'm not familiar with citeproc-hs or pandoc but anyone else wanting to use this format would be able to by ensuring they create a globally unique URI for each document. An ID which is only locally unique should be augmented to ensure it's globally unique before adding it to the "uris" array.

bdarcus commented 13 years ago

Sorry if I'm being dense, but on 1, then you're embedding the data; not referencing it?

E.g. if I have 50 references to the same source in a document (say a book manuscript), will the data will be repeated 50 times, or included once?

SteveRidout commented 13 years ago

It will be repeated 50 times.

It may not seem elegant but this way a user can copy and paste a field from one document to another and be sure that all the necessary metadata is present.

The alternative would be to embed the metadata at a document level, e.g. in the document properties. This would be more complicated to implement, increasing likelihood of bugs, and very tricky to ensure the links are all valid if the user copies and pastes citations between documents.

bdarcus commented 13 years ago

OK, so final question (thanks for bearing with me): should we remove that id key, or just make it optional?

And how should we adapt the (currently non-existant?) documentation accordingly so it's clear?

We'll do whatever you and Simon agree on.

SteveRidout commented 13 years ago

It's probably best to completely remove it, which would mean removing the "id" object starting on line 25 of csl-citation.json and replacing the reference on line 33 ("$ref": "csl-data.json/#/items") with a copy of the referenced object without the "id" field.

Only problem is that the output of the current Mendeley generated citations would be invalid unless two of the additional-properties : "false" conditions were removed.

rmzelle commented 13 years ago

The main reason for adding the additional-properties : "false" conditions was that it made it easier for me to test the JSON schema against the input JSON objects in the test suite. I guess they can just be stripped if the schema is used in production.

bdarcus commented 13 years ago

I'm just curious on what Andrea has to say about this; if he would use either of these schema, and if it makes any difference to him. I pinged him for comment.

arossato commented 13 years ago

I confess I'm a bit confused since I was not aware of the fact that the Embedded Citation Object Format had been proposed, so I'm quickly reviewing the relevant documentation right now.

From the pandoc/citeproc-hs perspective, we presently do not embed a citation object into a document, since the object is created when a pandoc document is processed. Keep in mind that a pandoc/markdown document should be readable without processing as a normal text file and a citation consists in some text representing:

an optional prefix;
a required ID;
an optional locator;
an optional suffix.

Two examples: "See @Smith2011, chap. 18, for an example; see also @Brown2011"

Moreover, at the present time pandoc has no way of embedding references into a document (even though there are plans we have discussed to allow such a possibility), and so the ID is what is used to locate the reference in an external database (so far only database format supported by bibutils or the JSON format used in the test-suite may be used).

Now, citeproc-hs might map the ID (in markdown the text following the '@') to the URIs used in the proposted object format, and possibly use it to retrieve the reference, but the idea of requiring URIs to cite a document in a text based format like pandoc/markdown doesn't seem to be an optimal solution to me.

In other words, I do not see any major obstacles in implementing this citation object format on my side, even though it is not clear how a pandoc user may benefit from it.

But, once again, these are just my preliminary thoughts on this subject.

bdarcus commented 13 years ago

Just to clarify, the main reason for this is to allow portable Word and ODF files, so that one can create in Zotero, edit in Mendeley, etc.

Pandoc is a bit different case in the sense it's really a batch format. But at some point, I could imagine it's ODT output writer being tweaked to produce this, so that the documents could be edited.

fbennett commented 13 years ago

I completely missed this thread until today (probably a good thing, too, since on first reading I misunderstood the exact frame of the discussion -- Bruce's final note above turned on the lights). Just to be sure I'm on the same page with everyone else, here's my understanding.

The processor APIs (both citeproc-js and citeproc-hs) do expect to be fed a local-unique itemID for citationItems. A citationID is also needed for processing, but the processor will assign a random unique value to identify a citation if it is not supplied. It is important that the citationID be stored in and delivered out of the target document when processing edits, because it is used by the processor to track the context of individual references. (Edit: The same is not true of the itemID.)

A local-unique itemID must be supplied to the processor for each referenced item, and the supplied ID must resolve into an input object when fed to the locally defined citeproc.retrieveItems() function. But there is no need for the local ID to be stored in the document; if an array of URIs from the document can be resolved into a local ID for submission to the processor, that is sufficient.

For batch processing, an interesting path forward will be to embed metadata in the ODT document, as Bruce suggests. That could work in two ways. For completely standalone systems, the user could be required to set a unique URI stub for their personal data in a config file, with the stub used to build URIs embedded in the document. Exchange-compatible systems would then see the embedded references as "new", triggering the acquisition/mapping process Steve describes.

(Another case would be for batch processing systems that acquire data directly from an exchange-compatible system itself [a Zotero or Mendeley database]. For example, Erik Hetzner's zotero-plain [for reStructuredText] should be able to embed a proper URI from the user's Zotero database, and produce documents that are a one-for-one match with documents written by the same user via the OpenOffice Zotero plugin. Not sure what the demand will be for that, but it's an interesting possibility.)

citation-style-language / schema

id should be optional in Embedded Citation Object Format #70