citation-style-language / schema

Citation Style Language schema
https://citationstyles.org/
MIT License
187 stars 60 forks source link

Add sub/main forms to json schema #310

Closed denismaier closed 4 years ago

denismaier commented 4 years ago

On the other hand, there seems to be a preference for flat CSL JSON (ignoring the funky date-variables nesting). And some of the flat fields already exist like container-title-short and backcompat would be good.

If going with the flat structure, I think it's best to update the CSL JSON schema to include these properties. I don't think it makes sense to add -short variants where there is not a documented or theoretical need.

To enable the flexibility to add suffixes to CSL fields, we could look into JSON Schema pattern properties. This would allow us to keep the number of schema definitions from exploding:

{
  "type": "object",
  "patternProperties": {
    "^title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

This looks promising. This will work for title, right? What's the best way to add the prefix patterns? Does that work?

{
  "type": "object",
  "patternProperties": {
    "^(container-|collection-)?title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Or should we add one pattern per title?

{
  "type": "object",
  "patternProperties": {
    "^title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^container-title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^collection-title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Readability is better here, but it is redundant, of course. Maybe something like this?

{
  "type": "object",
  "patternProperties": {
    "^(container-
        |collection-
        |volume-)?
        title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Originally posted by @denismaier in https://github.com/citation-style-language/schema/pull/271#issuecomment-649559085

It seems that we currently deal with sub/main forms in the rnc schema, but there's nothing on the input side. Shouldn't we add those? (I was about to start writing the documentation for the split-title feature and also to prepare some tests. But it looks like there are still some open quesitons...)

Edit: Currently, it looks like we'll support this by changing titles to objects.

denismaier commented 4 years ago

Opinions @bwiernik @bdarcus @dhimmel

bwiernik commented 4 years ago

For the prefixes, can we refer to an enumerated list of the title variables defined elsewhere in the schema?

bdarcus commented 4 years ago

I think we should defer this.

denismaier commented 4 years ago

I think we should defer this.

Fine, but till when?

But concerning the tests/documentation: Can I expect title-main, title-sub, etc. being available on the input side somehow, whether via patterns or explicitely defined?

bdarcus commented 4 years ago

I think we assume available in styles via @form; so extracted from a full title.

bdarcus commented 4 years ago

Oh, and when: IDK.

Until we have some experience, from users and developers?

denismaier commented 4 years ago

I think we assume available in styles via @Form; so extracted from a full

Yes, but @form is only in the rnc. I'm talking about the json schema. We will need a way to explicitely define the main form of a title variable because the extraction mechanism might produce wrong results.

title: One --- Two --- Three:  a subtitle

Depending on the settings, a citeproc will (incorrectly) produce:

main: One
sub: Two --- Three:  a subtitle

So, we'll need to supply the main form explicitely in addition to the full form:

title: One --- Two --- Three:  a subtitle
title-main: One --- Two --- Three 
bdarcus commented 4 years ago

I understand that.

But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.

It's just my impulse; if others feel strongly, we can consider those arguments.

It does feel somehow wrong to have four different title variants in the actual data.

To repeat history, we introduced the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.

denismaier commented 4 years ago

I understand that.

But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.

It's just my impulse; if others feel strongly, we can consider those arguments.

It does feel somehow wrong to have four different title variants in the actual data.

I understand your point. Having used biblatex before, I'd rather just have title-main and title-sub, but that would be a massive change, and probably not something that we would want to do.

I think for this feature to work reliably, we'd need to have this overriding mechanism. I think most users will not need to supply title-main and title-sub in the data. But for the other cases this should be possible.

To repeat history, we introduce the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.

Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him, title-short and title-main are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Using title-short as an indicator for the main form is not a good idea.

In any case, users being able to supply title parts in some way was a basic assumption of what @bwiernik and I have worked out.

bdarcus commented 4 years ago

Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him, title-short and title-main are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Using title-short as an indicator for the main form is not a good idea.

This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.

By far the most common example is this sort of pattern:

Some Title: With a Subtitle

For that, main title = short title.

So at minimum, we need to explain the difference in docmentation.

denismaier commented 4 years ago

This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.

I'm not so sure most users will have the main part of the title in title-short, regardless of what was the original reason for adding this, and I doubt most users were ever aware of the logic behind that. I think it's much more likely they'll have there some shorter version of the main title because this is what style guide usually require. E.g. Chicago:

Grazer, Brian, and Charles Fishman. A Curious Mind: The Secret to a Bigger Life. New York: Simon & Schuster, 2015. => Curious Mind

Borel, Brooke. The Chicago Guide to Fact-Checking. Chicago: University of Chicago Press, 2016. => Fact-Checking

Keng, Shao-Hsun, Chun-Hung Lin, and Peter F. Orazem. “Expanding College Access in Taiwan, 1978–2014: Effects on Graduate Quality and Income Inequality.” Journal of Human Capital 11, no. 1 (Spring 2017): 1–34. https://doi.org/10.1086/690235 =>Expanding College Access

Mead, Rebecca. “The Prophet of Dystopia.” New Yorker, April 17, 2017. => Dystopia

Rutz, Cynthia Lillian. “King Lear and Its Folktale Analogues.” PhD diss., University of Chicago, 2013. => King Lear

Of course, this needs to be documented accordingly.

bdarcus commented 4 years ago

The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.

I'm not thrilled with the idea, but think it worth considering, given the need to override auto-splitting should be rare.

denismaier commented 4 years ago

By far the most common example is this sort of pattern:

Some Title: With a Subtitle

For that, main title = short title.

So in that particular case, there's no need to change anything!

denismaier commented 4 years ago

The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.

In the proposal for this feature, we proposed to split multiple subtitles in title-sub with two vertical bars ||. @dstillman has also already suggested using some sort of markup to indicate split points. I really don't care much one or the other, as long as there is some way to override the automatic splits. We could also just use || on the full title, if that is an easy solution. @bwiernik ?

bdarcus commented 4 years ago

We could also just use || on the full title, if that is an easy solution.

So in the common (perhaps 99% or more) case, titles stay the same, and in the other case, one could just do ...

title: Some Weird Title ... || With a Subtitle

...?

denismaier commented 4 years ago

Hopefully, yes. The example above would be:

title: One --- Two --- Three:|| a subtitle
bwiernik commented 4 years ago

Yes.

That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:

author: Jones || Davey

denismaier commented 4 years ago

That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:

Just that this would be used on the standard title field...

@bwiernik Was there a reason we did not consider this option in the first place?

bdarcus commented 4 years ago

So then details depend on the spec language.

We could have something like:

Processors must split titles according the [insert rules], or on the || pattern.

So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?

Does that mean a full title is not rendered directly, but is always reassembled from the split title?

denismaier commented 4 years ago

Does that mean a full title is not rendered directly, but is always reassembled from the split title?

Current proposal here says (point 3):

Parsing by citeproc: If title-main and title-sub are not supplied in the data, the citeproc will derive them from title following these rules (based on existing citeproc-js behavior):

So, the answer to your question is yes. Citeprocs will always split titles into main and sub, and then reassemble. We could add a new option for @title-split "false", or similar, to disable that.

denismaier commented 4 years ago

So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?

Split characters are defined with @title-split.

bdarcus commented 4 years ago

Then a processor will just, for example, internally have some variable of characters to split on, and || overrides those?

So the splitting can be auto or manual, but not both?

And what did we decide about sub-sub titles? Is this valid to do internally?

>>> split_characters = re.compile('[\?,:]')
>>> split_characters.split("One: Two? Three")
['One', ' Two', ' Three']
denismaier commented 4 years ago

Then a processor will just, for example, internally have some variable of characters to split on, and || overrides those?

So the splitting can be auto or manual, but not both?

@title-split defines the split-points for automatic splitting. E.g. with title-split=""simple" processors will split on ., :, ::, !, ?. If this leads to incorrect results for some reason, you'd have to override the automatic behaviour with ||.

And yes, concerning sub-sub titles: that's mainly it.

"One: Two? Three" will be split into:

title[@form="main"]:  One
title[@form="sub"]: 
  - Two?
  - Three
bdarcus commented 4 years ago

So the split characters are either defined in the style OR ||; right?

In any case, this seems like the direction that's sensible. It would mean just a small change to the spec, and no change to the input schema.

bwiernik commented 4 years ago

Yeah, there was no reason to not do this in the first place. Just didn’t occur to me. This is better.

denismaier commented 4 years ago

So the split characters are either defined in the style OR ||; right?

Yes, processors will need to check if there are explicit split-points defined with ||, and if not, split using the split characters defined in the style.

bdarcus commented 4 years ago

So to be or precise, it would split on ||, and only do a second pass if no array output.

denismaier commented 4 years ago

So, summing this up: I will start drafting the documentation and the test based on these assumptions:

  1. In styles, we'll have @form="sub" and @form="main" available. The standard/long form of a title will be the reassembled title.
  2. On the input side, users can provide split-points explicitly with ||.
bdarcus commented 4 years ago

I put this placeholder in this PR I just pushed, where we can include this.

dhimmel commented 4 years ago

Or should we add one pattern per title?

I like one pattern per line, so you can have title and description fields that apply to the family of variables, like container.

bdarcus commented 4 years ago

Just noting here that another possibility that people might wonder about:

Allowing titles to be an object (sub/main) or array.

So, per discussion in the rich text issue, expect the apps to create the pre-parsed data.

I'm not saying we should do this, but It does occur to me we'll need to provide for a longer review period for 1.1 in general, so people can consider all of these changes.

And you might try to prepare the PR in a way that it would be easier to change, if people would prefer that option.

denismaier commented 4 years ago

Allowing titles to be an object (sub/main) or array.

We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?

denismaier commented 4 years ago

So, per discussion in the rich text issue, expect the apps to create the pre-parsed data.

That would perhaps be an option with Zotero and pandoc, but I'm less optimistic with other apps. That's why we thought best is to implement thus in the processor.

bdarcus commented 4 years ago

Allowing titles to be an object (sub/main) or array.

We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?

Probably the best approach for zotero et al is to do titles like they do names.

bwiernik commented 4 years ago

I think both default parsing behavior and providing a common syntax for users to override default parsing behavior are necessary in the processor for CSL to be at its best with typical bibliographic data in the wild.

bdarcus commented 4 years ago

In the end, the decision will come down who's responsible for parsing; user, client app, intermediate tools, csl processor.

I also raise this now because I think it overlaps with the rich text discussion (#315).

bwiernik commented 4 years ago

I think those issues really aren't that related. "Parsing" involves many things, and not all of of them have the answer. I'm coming around to thinking that rich text markup might be something the processor needn't necessarily worry about (I'm doing some investigation into what journals provide there).

But parsing of titles is much more similar to testing is-numeric in my view. This is something where data exist in the wild

So, for the many places where CSL is used outside of a person writing a manuscript, such as as Cite this For Me or Open Science Framework, the only tool in the chain is the citation processor. Asking every potential little application adopting CSL to roll their own title parser, name parser, etc. seems like a huge barrier to entry.

bdarcus commented 4 years ago

It actually is connected, because titles would no long just be strings. The processing thus would necessarily change.

But in the end, we really need feedback from developers, rather than to speculate. Why I started the thread on discourse, even if it's not that active these days.

bdarcus commented 4 years ago

Could we revise the issue description to include a concise list of requirements?

I think that would help us make final decisions.

For example:

  1. sub-components of titles need to be accessed in styles
  2. delimiters among these sub-components need to be configurable in styles and locales, for full title rendering

Are those two correct?

And then what about the other wrinkle that is making this so difficult?

Is it that some styles require printing full titles without modifying the sub-component punctuation?

So in those styles, would one also need the 1 requirement above to access components?

Is this the only other requirement; so three?

bwiernik commented 4 years ago

Is it that some styles require printing full titles without modifying the sub-component punctuation?

Yes. Chicago modifies punctuation, APA and Vancouver do not. Both types are common.

So in those styles, would one also need the 1 requirement above to access components?

I don’t think we identified any style where separate formatting of main/sub AND keeping original punctuation were needed.

We had planned for the CSL style syntax to not permit that—separate formatting of main and sub is accomplished using a group with a specified delimiter.

The data model thus needs to provide:

  1. Individual title parts
  2. The original punctuation separating those parts

Processors need to be able to:

  1. Concatenate title parts together for form-="long"
  2. Replace existing delimiter punctuation with normalized punctuation
    • perhaps add delimiters if none exist?
bdarcus commented 4 years ago

The data model thus needs to provide:

  1. Individual title parts
  2. The original punctuation separating those parts

If the way I stated the third requirement is correct (and you confirmed it is), isn't it more simple than this?

Isn't it that the processor needs access to the full title, full stop?

If yes, then your second requirement is not needed; what is needed is for the full title property to be filled.

So some styles require title decomposition and recomposition, and some don't?

Why I'm asking this question.

bwiernik commented 4 years ago

I don’t understand what you are saying your “third requirement” is.

Another way to put the requirement:

  1. Render the full title, with original punctuation
    • Main casing options:
      • No change
      • Title case
      • Uppercase main title first
      • Uppercase main title first and subtitle(s) first
  2. Render the full title, with normalized punctuation
    • Main casing options:
      • No change
      • Title case
      • Uppercase main title first
      • Uppercase main title first and subtitle(s) first
  3. Render a title part separately

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

bwiernik commented 4 years ago

To put these together with examples:

  1. Vancouver style: no decomposition, no subtitle capitalization
  2. APA style: no decomposition, subtitle capitalization
  3. Chicago style: decomposition, subtitle capitalization
  4. ABNT style: decomposition, separate main and sub text formatting
bdarcus commented 4 years ago

I don’t understand what you are saying your “third requirement” is.

Just to clarify this part, I meant this from above:

And then what about the other wrinkle that is making this so difficult? Is it that some styles require printing full titles without modifying the sub-component punctuation?

But your explanation here further clarifies.

bdarcus commented 4 years ago

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

OK, this is a key piece I was missing.

So among styles which do not specify decomposition, if we have a full title, some will specify to modify casing, and others will specify to leave it alone.

The problem this presents is with a full title, a processor won't have access to the sub-components, so it won't be able to modify the casing.

@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?

denismaier commented 4 years ago

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

OK, this is a key piece I was missing.

Yeah, if citeprocs needed to compare "full" to "main" and "sub" to capitalize properly they just could do the whole splitting operation on their own, which is what using objects here tries to avoid.

denismaier commented 4 years ago

@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?

You mean the original PR?

bdarcus commented 4 years ago

I mean this thread; the top post.