Closed denismaier closed 4 years ago
Opinions @bwiernik @bdarcus @dhimmel
For the prefixes, can we refer to an enumerated list of the title variables defined elsewhere in the schema?
I think we should defer this.
I think we should defer this.
Fine, but till when?
But concerning the tests/documentation: Can I expect title-main
, title-sub
, etc. being available on the input side somehow, whether via patterns or explicitely defined?
I think we assume available in styles via @form
; so extracted from a full title.
Oh, and when: IDK.
Until we have some experience, from users and developers?
I think we assume available in styles via @Form; so extracted from a full
Yes, but @form
is only in the rnc. I'm talking about the json schema. We will need a way to explicitely define the main form of a title variable because the extraction mechanism might produce wrong results.
title: One --- Two --- Three: a subtitle
Depending on the settings, a citeproc will (incorrectly) produce:
main: One
sub: Two --- Three: a subtitle
So, we'll need to supply the main form explicitely in addition to the full form:
title: One --- Two --- Three: a subtitle
title-main: One --- Two --- Three
I understand that.
But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.
It's just my impulse; if others feel strongly, we can consider those arguments.
It does feel somehow wrong to have four different title variants in the actual data.
To repeat history, we introduced the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.
I understand that.
But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.
It's just my impulse; if others feel strongly, we can consider those arguments.
It does feel somehow wrong to have four different title variants in the actual data.
I understand your point. Having used biblatex before, I'd rather just have title-main
and title-sub
, but that would be a massive change, and probably not something that we would want to do.
I think for this feature to work reliably, we'd need to have this overriding mechanism. I think most users will not need to supply title-main
and title-sub
in the data. But for the other cases this should be possible.
To repeat history, we introduce the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.
Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him, title-short
and title-main
are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Using title-short
as an indicator for the main form is not a good idea.
In any case, users being able to supply title parts in some way was a basic assumption of what @bwiernik and I have worked out.
Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him,
title-short
andtitle-main
are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Usingtitle-short
as an indicator for the main form is not a good idea.
This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.
By far the most common example is this sort of pattern:
Some Title: With a Subtitle
For that, main title = short title.
So at minimum, we need to explain the difference in docmentation.
This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.
I'm not so sure most users will have the main part of the title in title-short
, regardless of what was the original reason for adding this, and I doubt most users were ever aware of the logic behind that. I think it's much more likely they'll have there some shorter version of the main title because this is what style guide usually require. E.g. Chicago:
Grazer, Brian, and Charles Fishman. A Curious Mind: The Secret to a Bigger Life. New York: Simon & Schuster, 2015. => Curious Mind
Borel, Brooke. The Chicago Guide to Fact-Checking. Chicago: University of Chicago Press, 2016. => Fact-Checking
Keng, Shao-Hsun, Chun-Hung Lin, and Peter F. Orazem. “Expanding College Access in Taiwan, 1978–2014: Effects on Graduate Quality and Income Inequality.” Journal of Human Capital 11, no. 1 (Spring 2017): 1–34. https://doi.org/10.1086/690235 =>Expanding College Access
Mead, Rebecca. “The Prophet of Dystopia.” New Yorker, April 17, 2017. => Dystopia
Rutz, Cynthia Lillian. “King Lear and Its Folktale Analogues.” PhD diss., University of Chicago, 2013. => King Lear
Of course, this needs to be documented accordingly.
The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.
I'm not thrilled with the idea, but think it worth considering, given the need to override auto-splitting should be rare.
By far the most common example is this sort of pattern:
Some Title: With a Subtitle
For that, main title = short title.
So in that particular case, there's no need to change anything!
The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.
In the proposal for this feature, we proposed to split multiple subtitles in title-sub
with two vertical bars ||
. @dstillman has also already suggested using some sort of markup to indicate split points. I really don't care much one or the other, as long as there is some way to override the automatic splits. We could also just use ||
on the full title, if that is an easy solution. @bwiernik ?
We could also just use
||
on the full title, if that is an easy solution.
So in the common (perhaps 99% or more) case, titles stay the same, and in the other case, one could just do ...
title: Some Weird Title ... || With a Subtitle
...?
Hopefully, yes. The example above would be:
title: One --- Two --- Three:|| a subtitle
Yes.
That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:
author: Jones || Davey
That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:
Just that this would be used on the standard title
field...
@bwiernik Was there a reason we did not consider this option in the first place?
So then details depend on the spec language.
We could have something like:
Processors must split titles according the [insert rules], or on the
||
pattern.
So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?
Does that mean a full title is not rendered directly, but is always reassembled from the split title?
Does that mean a full title is not rendered directly, but is always reassembled from the split title?
Current proposal here says (point 3):
Parsing by citeproc: If title-main and title-sub are not supplied in the data, the citeproc will derive them from title following these rules (based on existing citeproc-js behavior):
So, the answer to your question is yes. Citeprocs will always split titles into main and sub, and then reassemble. We could add a new option for @title-split
"false", or similar, to disable that.
So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?
Split characters are defined with @title-split
.
Then a processor will just, for example, internally have some variable of characters to split on, and ||
overrides those?
So the splitting can be auto or manual, but not both?
And what did we decide about sub-sub titles? Is this valid to do internally?
>>> split_characters = re.compile('[\?,:]')
>>> split_characters.split("One: Two? Three")
['One', ' Two', ' Three']
Then a processor will just, for example, internally have some variable of characters to split on, and
||
overrides those?So the splitting can be auto or manual, but not both?
@title-split
defines the split-points for automatic splitting. E.g. with title-split=""simple"
processors will split on .
, :
, ::
, !
, ?
. If this leads to incorrect results for some reason, you'd have to override the automatic behaviour with ||
.
And yes, concerning sub-sub titles: that's mainly it.
"One: Two? Three" will be split into:
title[@form="main"]: One
title[@form="sub"]:
- Two?
- Three
So the split characters are either defined in the style OR ||
; right?
In any case, this seems like the direction that's sensible. It would mean just a small change to the spec, and no change to the input schema.
Yeah, there was no reason to not do this in the first place. Just didn’t occur to me. This is better.
So the split characters are either defined in the style OR
||
; right?
Yes, processors will need to check if there are explicit split-points defined with ||
, and if not, split using the split characters defined in the style.
So to be or precise, it would split on ||, and only do a second pass if no array output.
So, summing this up: I will start drafting the documentation and the test based on these assumptions:
@form="sub"
and @form="main"
available. The standard/long form of a title will be the reassembled title.||
. I put this placeholder in this PR I just pushed, where we can include this.
Or should we add one pattern per title?
I like one pattern per line, so you can have title
and description
fields that apply to the family of variables, like container
.
Just noting here that another possibility that people might wonder about:
Allowing titles to be an object (sub/main) or array.
So, per discussion in the rich text issue, expect the apps to create the pre-parsed data.
I'm not saying we should do this, but It does occur to me we'll need to provide for a longer review period for 1.1 in general, so people can consider all of these changes.
And you might try to prepare the PR in a way that it would be easier to change, if people would prefer that option.
Allowing titles to be an object (sub/main) or array.
We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?
So, per discussion in the rich text issue, expect the apps to create the pre-parsed data.
That would perhaps be an option with Zotero and pandoc, but I'm less optimistic with other apps. That's why we thought best is to implement thus in the processor.
Allowing titles to be an object (sub/main) or array.
We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?
Probably the best approach for zotero et al is to do titles like they do names.
I think both default parsing behavior and providing a common syntax for users to override default parsing behavior are necessary in the processor for CSL to be at its best with typical bibliographic data in the wild.
In the end, the decision will come down who's responsible for parsing; user, client app, intermediate tools, csl processor.
I also raise this now because I think it overlaps with the rich text discussion (#315).
I think those issues really aren't that related. "Parsing" involves many things, and not all of of them have the answer. I'm coming around to thinking that rich text markup might be something the processor needn't necessarily worry about (I'm doing some investigation into what journals provide there).
But parsing of titles is much more similar to testing is-numeric
in my view. This is something where data exist in the wild
So, for the many places where CSL is used outside of a person writing a manuscript, such as as Cite this For Me or Open Science Framework, the only tool in the chain is the citation processor. Asking every potential little application adopting CSL to roll their own title parser, name parser, etc. seems like a huge barrier to entry.
It actually is connected, because titles would no long just be strings. The processing thus would necessarily change.
But in the end, we really need feedback from developers, rather than to speculate. Why I started the thread on discourse, even if it's not that active these days.
Could we revise the issue description to include a concise list of requirements?
I think that would help us make final decisions.
For example:
Are those two correct?
And then what about the other wrinkle that is making this so difficult?
Is it that some styles require printing full titles without modifying the sub-component punctuation?
So in those styles, would one also need the 1 requirement above to access components?
Is this the only other requirement; so three?
Is it that some styles require printing full titles without modifying the sub-component punctuation?
Yes. Chicago modifies punctuation, APA and Vancouver do not. Both types are common.
So in those styles, would one also need the 1 requirement above to access components?
I don’t think we identified any style where separate formatting of main/sub AND keeping original punctuation were needed.
We had planned for the CSL style syntax to not permit that—separate formatting of main and sub is accomplished using a group
with a specified delimiter.
The data model thus needs to provide:
Processors need to be able to:
The data model thus needs to provide:
- Individual title parts
- The original punctuation separating those parts
If the way I stated the third requirement is correct (and you confirmed it is), isn't it more simple than this?
Isn't it that the processor needs access to the full title, full stop?
If yes, then your second requirement is not needed; what is needed is for the full title property to be filled.
So some styles require title decomposition and recomposition, and some don't?
Why I'm asking this question.
I don’t understand what you are saying your “third requirement” is.
Another way to put the requirement:
The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.
To put these together with examples:
I don’t understand what you are saying your “third requirement” is.
Just to clarify this part, I meant this from above:
And then what about the other wrinkle that is making this so difficult? Is it that some styles require printing full titles without modifying the sub-component punctuation?
But your explanation here further clarifies.
The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.
OK, this is a key piece I was missing.
So among styles which do not specify decomposition, if we have a full title, some will specify to modify casing, and others will specify to leave it alone.
The problem this presents is with a full title, a processor won't have access to the sub-components, so it won't be able to modify the casing.
@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?
The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.
OK, this is a key piece I was missing.
Yeah, if citeprocs needed to compare "full" to "main" and "sub" to capitalize properly they just could do the whole splitting operation on their own, which is what using objects here tries to avoid.
@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?
You mean the original PR?
I mean this thread; the top post.
This looks promising. This will work for
title
, right? What's the best way to add the prefix patterns? Does that work?Or should we add one pattern per title?
Readability is better here, but it is redundant, of course. Maybe something like this?
Originally posted by @denismaier in https://github.com/citation-style-language/schema/pull/271#issuecomment-649559085
It seems that we currently deal with sub/main forms in the rnc schema, but there's nothing on the input side. Shouldn't we add those? (I was about to start writing the documentation for the split-title feature and also to prepare some tests. But it looks like there are still some open quesitons...)
Edit: Currently, it looks like we'll support this by changing titles to objects.