citation-style-language / schema

Citation Style Language schema
https://citationstyles.org/
MIT License
182 stars 61 forks source link

Better JSON standardization #130

Open dsifford opened 8 years ago

dsifford commented 8 years ago

Hello,

Just digging into the schema and it seems like there isn't much standardization when it comes to property naming conventions and even some property values in the JSON schema.

To give some examples:

Naming conventions There seems like there is a good mixture of camelCase and kebab-case in properties

shortTitle vs collection-editor (okay, perhaps this only occurs once)

Property value conventions The following properties accept both type string and type number

id
issue
number
number-of-volumes
volume

The following properties only accept type string

call-number
chapter-number
citation-number
collection-number
number-of-pages
page
page-first
version

Just a friendly drop in the suggestion box for a better standardization on the next schema update 👍

rmzelle commented 8 years ago

Yeah, good call.

For shortTitle, this is a duplicate of https://github.com/citation-style-language/schema/issues/113.

Regarding the property value conventions, the latter properties should probably be relaxed to accept both "string" and "number".

dsifford commented 8 years ago

Agreed!

If the next schema could go entirely camelcase, I'd be a happy camper. It gets tedious to have to wrap everything in brackets.. (eg. citation['chapter-number'] vs citation.chapterNumber).

(That second request is me dreaming; I can't even imagine the work it would likely take to go in and convert all those properties, especially with production apps using them already)

In any case, thanks for the response!

rmzelle commented 8 years ago

Yeah, since "shortTitle" is the odd one out, I doubt we'll ever switch everything to camelcase, although I see the benefit from a programming point of view.

dsifford commented 8 years ago

@rmzelle While I have your attention, I'm having a really hard time understanding the date-variable part of the JSON spec.

{
  "id": "date-variable",
  "type": [
    {
      "properties": {
        "date-parts": {
          "type": "array",
          "items": {
            "type": "array",
            "items": {
              "type": [
                "string",
                "number"
              ]
            },
            "maxItems": 3
          },
          "maxItems": 2
        },
        "season": {
          "type": [
            "string",
            "number"
          ]
        },
        "circa": {
          "type": [
            "string",
            "number",
            "boolean"
          ]
        },
        "literal": {
          "type": "string"
        },
        "raw": {
          "type": "string"
        }
      },
      "additionalProperties": false
    },
    {
      "properties": {
        "literal": {
          "type": "string"
        }
      },
      "additionalProperties": false
    }
  ]
}

Is literal a second object? Is date-variable an array containing both date-parts and literal?

Can you clarify?

dsifford commented 8 years ago

Actually, I'm noticing now that literal is listed twice. Once in it's own object, and once just under circa in first object.

rmzelle commented 8 years ago

Uhm, I'm not a 100% sure. I created the schema based on the JSON input snippets in the citeproc-js test suite, and this schema is therefore more descriptive than normative. It's also been a while since I last looked at this.

@fbennett, do you have an answer? In general, the "literal" property can be used to provide an unparsed date string (see https://web.archive.org/web/20150911220234/http://gsl-nagoya-u.net/http/pub/citeproc-doc.html#dates ; @fbennett, is this online elsewhere?)

dsifford commented 8 years ago

No problem at all. I appreciate the help! Thanks for reaching out to the others 😄

njbart commented 8 years ago

According to the citeproc-js specs, date-parts may contain date ranges (not sure whether that is explicit in the schema).

However, it seems season and circa can only be set once per date-variable (that’s certainly what the citeproc-js specs suggest), not – as they should – for either or both of the start and the end date of the range.

Also there seems to be some confusion around the terms “approximate”/“circa” and “uncertain” (see, e.g., http://docs.citationstyles.org/en/stable/specification.html#approximate-dates: “Approximate dates test “true” for the is-uncertain-date conditional …”).

I would argue both should be clearly distinguished, with one flag for “approximate”/“circa”, and another one for “uncertain”, along the lines of EDTF.

Again, one should be able to set either of these separately for start and end dates in a range (as for example in EDTF: “1984~/2004-06, interval beginning approximately 1984 and ending June 2004; 1984/2004-06~, interval beginning 1984 and ending approximately June 2004; 1984?/2004?~, interval whose beginning is uncertain but thought to be 1984, and whose end is uncertain and approximate but thought to be 2004”).

fbennett commented 8 years ago

What are the prospects for migrating all citeproc-js-specific engines to use EDTF as input format?

rmzelle commented 8 years ago

What are the prospects for migrating all citeproc-js-specific engines to use EDTF as input format?

Is that a rhetorical question?

@nickbart1980, maybe we should start with checking what needs to be done to fully represent EDTF (at least levels 0 and 1) in JSON? I assume other extensions are needed as well.

fbennett commented 8 years ago

On Jun 21, 2016 05:42, "Rintze M. Zelle" notifications@github.com wrote:

What are the prospects for migrating all citeproc-js-specific engines to use EDTF as input format?

Is that a rhetorical question?

Waking typo + autocorrect. Should have been:

What are the prospects for migrating all citeproc engines to use EDTF as input format?

@nickbart1980, maybe we should start with checking what needs to be done to fully represent EDTF (at least levels 0 and 1) in JSON? I assume other extensions are needed as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

fbennett commented 8 years ago

On Tue, Jun 21, 2016 at 6:15 AM, Frank Bennett biercenator@gmail.com wrote:

On Jun 21, 2016 05:42, "Rintze M. Zelle" notifications@github.com wrote:

What are the prospects for migrating all citeproc-js-specific engines to use EDTF as input format?

Is that a rhetorical question?

Waking typo + autocorrect. Should have been:

What are the prospects for migrating all citeproc engines to use EDTF as input format?

@nickbart1980, maybe we should start with checking what needs to be done to fully represent EDTF (at least levels 0 and 1) in JSON? I assume other extensions are needed as well.

It would be good to pin down how the date values expressed by the various EDTF constructs themselves should be handled in CSL itself (how to use ranges, what to do with seasons, what tests for uncertain/unknown should be available and how they should work etc). Then tests could be built.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

njbart commented 8 years ago

What are the prospects for migrating all citeproc engines to use EDTF as input format?

For pandoc-citeproc I’d guess the prospects are good: biblatex, one of its popular input formats, has just migrated to EDTF (in v3.5), so pandoc-citeproc will at least have to start accepting EDTF constructs in its input anyway.

… how the date values expressed by the various EDTF constructs themselves should be handled in CSL itself …

EDTF (levels 0 and 1): citeprocs would have to parse each individual date (separately for start and end dates of a range) into at least:

year, month, day, time, time zone, season, uncertain, circa=approximate

EDTF “5.2.2 Unspecified” is a bit tricky. We might want to render 199u as “1990s”, and 19uu as “1900s”, but I’m not sure how to best represent this internally. (It’s not clear how to render 190u either – “1900s” is sort of reserved for the century, and “1900s, 1st decade” is somewhat unwieldy.) Other EDTF constructs could be truncated without much loss of information: 1999-uu, defined as “some month in 1999” could simply be mapped to “1999”.

CSL also needs a way to represent “unknown” or “open”. Representing this by “0” will not work well since EDTF’s year numbering includes the year “0”.

For representing year “0” and negative years, I would argue that CSL should use EDTF’s astronomical numbering of years (where year “0” is “1 BC/BCE”, year “-1” is “2 BC/BCE”, year “-99” is “100 BC/BCE” etc.). – For rendering, CSL could provide functions such as “render-in-edtf-format” (leaving the year unchanged), “render-in-ce-bce-format” (for years < 1, remove minus sign, add 1 to the year, and attach “BC/BCE” label), and possibly even (since EDTF uses the Gregorian calendar throughout but users might want a Julian date) “render-in-ce-bce-format-julian”.

Finally, it might be worth checking out existing EDTF parsers, such as https://github.com/inukshuk/edtf-ruby or https://www.npmjs.com/package/edtf.

fbennett commented 8 years ago

Yeah. EDTF was in the drafting stages around the time citeproc-js was released. There was pressure to make it the required input format for dates, but there wasn't a JS parser, I wasn't up to the task, and we ended up running with what I had cobbled together. Sylvester recently released a JS EDTF parser, though, and that changes things (hence my q above).

https://github.com/inukshuk/edtf.js

On Jun 21, 2016 15:51, "nickbart1980" notifications@github.com wrote:

What are the prospects for migrating all citeproc engines to use EDTF as input format?

For pandoc-citeproc I’d guess the prospects are good: biblatex, one of its popular input formats, has just migrated to EDTF (in v3.5), so pandoc-citeproc will at least have to start accepting EDTF constructs in its input anyway.

… how the date values expressed by the various EDTF constructs themselves should be handled in CSL itself …

EDTF (levels 0 and 1): citeprocs would have to parse each individual date (separately for start and end dates of a range) into at least:

year, month, day, time, time zone, season, uncertain, circa=approximate

EDTF “5.2.2 Unspecified” is a bit tricky. We might want to render 199u as “1990s”, and 19uu as “1900s”, but I’m not sure how to best represent this internally. (It’s not clear how to render 190u either – “1900s” is sort of reserved for the century, and “1900s, 1st decade” is somewhat unwieldy.) Other EDTF constructs could be truncated without much loss of information: 1999-uu, defined as “some month in 1999” could simply be mapped to “1999”.

CSL also needs a way to represent “unknown” or “open”. Representing this by “0” will not work well since EDTF’s year numbering includes the year “0”.

For representing year “0” and negative years, I would argue that CSL should use EDTF’s astronomical numbering of years (where year “0” is “1 BC/BCE”, year “-1” is “2 BC/BCE”, year “-99” is “100 BC/BCE” etc.). – For rendering, CSL could provide functions such as “render-in-edtf-format” (leaving the year unchanged), “render-in-ce-bce-format” (for years < 1, remove minus sign and add 1 to the year), and possibly even (since EDTF uses the Gregorian calendar throughout but users might want a Julian date) “render-in-ce-bce-format-julian”.

Finally, it might be worth checking out existing EDTF parsers, such as https://github.com/inukshuk/edtf-ruby.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/citation-style-language/schema/issues/130#issuecomment-227355069, or mute the thread https://github.com/notifications/unsubscribe/AAEmSpmp2y7r7UV3vQmU85Mp0vOr_2_Dks5qN4nWgaJpZM4ILCra .

njbart commented 8 years ago

Concerning edtf.js, there’s only one thing I’m deeply worried about: “the values array contains the individual date parts in a format compatible with JavaScript's Date semantics (months are a zero-based index)” (https://github.com/inukshuk/edtf.js).

For example, edtf.js represents 2015-02-15 as { type: 'Date', level: 0, values: [ 2015, 1, 15 ] }.

I know that’s an old JavaScript convention, but for our purposes this strikes me as so counterintuitive and error-prone that I wonder whether we could convince the author to change it.

bdarcus commented 8 years ago

Agree; that's just bad design.

fbennett commented 8 years ago

I get the reservation, but how would that affect a decision on whether to adopt EDTF as preferred input format in CSL JSON?

On Wed, Jun 22, 2016 at 8:48 PM, bdarcus notifications@github.com wrote:

Agree; that's just bad design.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/citation-style-language/schema/issues/130#issuecomment-227719946, or mute the thread https://github.com/notifications/unsubscribe/AAEmSmugCwBy7li-V2WHF-pKx1o_GzStks5qOSEGgaJpZM4ILCra .

njbart commented 8 years ago

Would that affect a decision on whether to adopt EDTF: Not at all.

It’s only the handling of months in edtf.js I find worrying, and my question to the js experts listening here is, would it make sense to try and get this fixed in edtf.js now, before starting any work on citeproc-js and possibly others.