citation-style-language / schema

Citation Style Language schema
https://citationstyles.org/
MIT License
181 stars 61 forks source link

"Add" CSL YAML #278

Open bdarcus opened 4 years ago

bdarcus commented 4 years ago

This and this suggests we can use our JSON schemas to validate a YAML alternative.

Pandoc already supports a YAML alternative (cc @jgm).

I suggest we do something with this, since it's zero work for us, and would give more options for users and developers.

Perhaps most sensible option is just adding a sentence to the spec that mentions this possibility, without requiring implementations to support it?

Edit: actually, we say nothing about input in the spec currently. So we would need to add a section on input data, and say that our schema can validate either json or yaml.

Proposal

Based on this discussion, we should:

  1. review the existing json schema for any possible adjustments we might want to make now. Date parts are one obvious discrepancy; are there any others?
  2. add an optional field for the content markup format to parse; the html subset could be default, and we could enumerate other options; say markup: org.

Originally posted by @bdarcus in https://github.com/citation-style-language/schema/issues/277#issuecomment-650192158

denismaier commented 4 years ago

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects. E.g. regarding dates: CSL JSON has this construct:

"issued":{"date-parts":[[2015,4,2]]},

Whereas the YAML:

  issued:
    - year: 2015
      month: 4
      day: 2

(I think I've tried to use the JSON schema for autocompletion and validation once, but I wasn't so lucky. Using Atom or VS Code as plain text reference managers could be very nice for some projects...)

bdarcus commented 4 years ago

Is pandocs CSL YAML actually identical to CSL JSON?

I hadn't checked, but wondering if devs like @jgm would find value in this.

Advantage is we have one schema, that is always in sync with the CSL spec.

bwiernik commented 4 years ago

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

bdarcus commented 4 years ago

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

Exactly.

YAML is easily hand-editable. JSON isn't.

denismaier commented 4 years ago

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

denismaier commented 4 years ago

@retorquere Do you have any input here?

bdarcus commented 4 years ago

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

https://github.com/liuderchi/ide-yaml/issues/56

retorquere commented 4 years ago

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects. E.g. regarding dates: CSL JSON has this construct:

To the best of my knowledge, this is the only difference. circa and season are supported here, circa at the same level as date-parts, not per-date. But I do not know of a formal spec.

jgm commented 4 years ago

Is pandoc's CSL YAML actually identical to CSL JSON?

No, not exactly. In addition to the difference noted (and maybe others which I've forgotten), the YAML bibliographies read by pandoc can have arbitrary pandoc markdown formatting. (And NOT the CSL HTML-ish formatting.) So it's not just a YAML translation of CSL JSON.

As I develop my new citeproc library, I may change things a bit to line things up more, while preserving backwards compatibility. For example, I think a date-parts field should be allowed, but I'd try to keep the current more elegant-looking syntax as an option too.

bdarcus commented 4 years ago

When we originally designed the json, focus was on machines.

But given evolution since, now might be a time to rethink some of the decisions, so we end with a solid representation well-suited to humans?

retorquere commented 4 years ago

@jgm but the html markup that csl-json supports is also valid markdown, so for export, there'd be no problem.

Edit: wait, does pandoc only support markdown tags, and not the html tags?

What other tools besides pandoc read csl-yaml?

bwiernik commented 4 years ago

I think accepting dates in either format would be fine in either schema. We just need to specify where the order of priority for redundant parts is (cf. there is an issue raised that we should specify in names that specific name parts are given priority over a literal field; we could do the same with dates--perhaps date-parts gets priority over the individual pieces).

With respect to markdown, I'm leery about making that universal, but I think we could add a flag to the data indicating that the data should be read as markdown.

bwiernik commented 4 years ago

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I'm a little concerned about assuming that, for example, _ or * always indicate text formatting.

retorquere commented 4 years ago

And the markdown escapes of those and others of course. Even if you only use html for markup. I hadn't thought of that before and I'll have to think about what to do for BBTs csl-yaml export.

I think it'd need to be explicitly marked if you want markdown processing, or markdown would have to be the default for csl-yaml. I'd rather not deal with ambiguity.

Are there other csl-yaml processors? If not, then I could have BBT default to markdown.

bwiernik commented 4 years ago

As the main consumer of CSL YAML is currently and I suspect will remain pandoc, I think we could make the default markdown with an option to disable?

pandoc interprets the HTML markup, so I'd suggest BBT not worry about translating the HTML tags into markdown.

retorquere commented 4 years ago

I'm not worried about those, but about whether to escape if I find * or #. They just mean different things to markdown and html.

bdarcus commented 4 years ago

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I read John to say above that he doesn't support the HTML.

bwiernik commented 4 years ago

If a user is generating CSL YAML with BBT, I'd expect that not escaping markdown characters (BBT's current behavior) is the better default.

bwiernik commented 4 years ago

I read John to say above that he doesn't support the HTML.

I checked. The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

bdarcus commented 4 years ago

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

denismaier commented 4 years ago

And with auto-completion it would be even

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

Both yes!

denismaier commented 4 years ago

What doesn't work in vs code is style validation...

bdarcus commented 4 years ago

Seems vscode doesn't support relaxng validation, and is dependent on this issue to add it.

bwiernik commented 4 years ago

Are RNC and XSD feature compatible? There are numerous XSD validators for vsscode? It might be possible to automatically generate an unofficial XSD schema from the RNC for use with editors.

jgm commented 4 years ago

The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

retorquere commented 4 years ago

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

That and the fact that if you assume markdown, *word* becomes \emph{word}, and if you assume HTML, it becomes *word*. And if pandoc is the sole consumer of CSL-YAML, it seems better to escape.

bdarcus commented 4 years ago

Are RNC and XSD feature compatible? There are numerous XSD validators for vsscode? It might be possible to automatically generate an unofficial XSD schema from the RNC for use with editors.

They are not feature compatible. RNG has some features we rely on, like unordered content models, that XSD does not support.

Trang will convert our schema to XSD, but it might not validate exactly the same. Here's the warnings for the v1.1 schema:

$ trang schemas/styles/csl.rnc csl.xsd
csl-choose.rnc:14:20: warning: cannot represent an optional group of attributes; approximating
csl-choose.rnc:15:5: warning: choice between attributes and children cannot be represented; approximating
csl.rnc:564:8: warning: choice between attributes and children cannot be represented; approximating
csl.rnc:651:50: warning: cannot represent an optional group of attributes; approximating

My guess it is "approximating" means by introducing a loser constraint in these cases, so that a style valid against the XSD schema may not be against the RNG.

It might be worth it, at this point, to see what the practical difference is, and whether we could tweak the rng schemas to convert perfectly. Except, the above warning seems a pretty important feature to have.

bdarcus commented 4 years ago

... if pandoc is the sole consumer of CSL-YAML, it seems better to escape.

The point of me opening this issue is to explore the possibility of promoting wider use, so that in the future it wouldn't be the "sole consumer."

retorquere commented 4 years ago

Alright, then I suppose I can wait out the choice on whether the behavior of pandoc is going to be the default, or going to be judged to be in error 😛 .

bdarcus commented 4 years ago

I'm not discouraging you; just providing context ;-)

I certainly would like md support.

bwiernik commented 4 years ago

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

We've discussed before about formally supporting the HTML-like markup in CSL-JSON. Here are some comments from @dstillman's that capture my concerns: https://discourse.citationstyles.org/t/sub-field-parsing/550/21

I think that, ideally, we would adopt one consistent set of markup formatting across CSL YAML and CSL-JSON, so that users and clients could reliably generate consistent outputs across processors (e.g., if a user switches from writing in Zotero/Word/CSL-JSON to Zotero/pandoc/CSL YAML, their markup should yield the same results without needing to be edited). Having one set of markup be used in one format and another in another format seriously limits the interoperability (cf. @retorquere goes to Herculean lengths to wrangle TeX and CSL markup together). The HTML-like markup would make it reasonably easy for diverse applications to handle markup using, e.g., rich text, HTML-like tags, or markdown as desired, then converting to a common markup format when generating the CSL data. The HTML syntax creates fewer opportunities for unexpected conversion than markdown.

I also agree with Dan's concerns about the ambiguities of handling full Markdown in CSL. For example, what would the meaning of > or # be? Is that expected that a citation might become a blockquote or heading unexpectedly? Could we determine a more limited set of Markdown that should be supported? Then, parsing of those limited Markdown characters could be reasonably controlled by added a support_markdown: "true" element to the CSL YAML?

For CSL YAML generated from an application with rich text formatting, the application can convert the rich text to the HTML-like tags and then set support_markdown: "false" (or omit support_makdown if we decide the default should be false rather than true). A limited set of markdown features would then make writing a conversion function to go between the two markup formats much easier; an application would only have to handle a limited number of cases instead of having a full markdown parser.

Proposal

So, my proposal would be:

  1. ask @jgm to add support for the HTML-like syntax in CSL YAML in pandoc.
  2. we agree to limit Markdown syntax supported to the analogs of the existing HTML-like syntax.
  3. we add a support_markdown flag to CSL YAML and CSL-JSON indicating whether the data should be interpreted as markdown or not (we can determine the best default)
  4. we provide reference conversion functions to convert markdown syntax to the HTML-like syntax (which is regarded as the canonical form)

Alternative proposal

If we would agree that CSL YAML is primarily intended to be used in Markdown environments (whether pandoc or otherwise), we could specify that CSL YAML should be parsed for Markdown formatting. Then:

  1. applications importing CSL YAML would know that they need to parse for Markdown on import
    • If they don't wish to package any amount of Markdown parser, they can choose to not import CSL YAML
  2. applications not wanting to support Markdown can know that they need to escape Markdown characters when generating CSL YAML
  3. we provide a CSL YAML–CSL-JSON converter to facilitate data transfer between applications with varying support

In this case, I still think these two should apply:

  1. a limited set of relevant Markdown syntax is supported
  2. pandoc recognizes the HTML-like syntax in CSL YAML
bdarcus commented 4 years ago

What are the formatting subset we need in this case, for what kinds of use cases?

Obviously the most basic is emphasis/italic (for embedded titles, and maybe latin phrases?) and strong/bold, but what beyond that?

Math is obviously the complex one; not sure how to handle that.

jgm commented 4 years ago

What makes most sense to me is to decouple the structure issue from the formatting issue. You could define a structure for citations (similar to CSL JSON, or perhaps with modifications for better human read/write-ability). The contents that fill this structure could be in any format you want to use (markdown, CSL-JSON HTMLish, full HTML, DocBook, whatever). So to know how to parse the thing, you'd need to know whether it's CSL YAML + Markdown, or CSL YAML + HTML, or whatever.

There are some constraints. Whatever format you use, certain operations would need to be defined in order for citations to be processed: in the library I'm working on, these include converting to and from plain text, dropping textual content from the beginning or end, moving punctuation inside quotes, and adding formatting (font variant, font style, etc.). In my library I use a type class to define the behaviors that are needed; for any format that can be used with the library, you'll need to define these transformations. One also needs a way to mark up content as "nocase."

My ideas here stem from working on pandoc-citeproc. When I originally forked Andrea Rossato's library citeproc-hs, the main reason was that I wanted people to be able specify things like document titles, abstracts, etc. using the full expressive power of pandoc (e.g., titles often contain math). So I changed things to use formatted Pandoc types for some of the fields. My new approach is to design a more generic library that can be used, in principle, with many different structured document types.

retorquere commented 4 years ago

We've discussed before about formally supporting the HTML-like markup in CSL-JSON. Here are some comments from @dstillman's that capture my concerns: https://discourse.citationstyles.org/t/sub-field-parsing/550/21

Of this, the one thing I don't agree with is math. Not that it'd be easy to fix, far from it, but it's not something that can be left to the tools that call citeproc because unicode (which is the only resort under the current circumstances) is just nowhere near expressive enough to get equivalents of even fairly simple math expressions.

Math is obviously the complex one; not sure how to handle that.

MathML?

bwiernik commented 4 years ago

I think it is really important to keep CSL data interoperable between applications, processors, and formats. There are a lot of CSL users, for example, using both pandoc and a Word or Google Docs plugin. We need to have an easy way for their data to be transportable across both those environments. Providing a limited set of allowed markup and a canonical form makes it easiest

The existing CSL HTMLlish tags are:

  1. <i>...</i> italics
  2. <b>...</b> bold
  3. <span style="font-variant:small-caps;">...</span> small capitals (pandoc also supports <sc>...</sc> which I think we should adopt)
  4. <sub>...</sub> subscript
  5. <sup>...</sup> superscript
  6. <span class="nocase">...</span> case-protect (I think we should also adopt <nc>...</nc>)

Math is the other issue. I think we could allow math as well. Most bibliographic data is provided using unicode or HTML entities, rather than in equations (e.g., simple things like χ<sup>2</sup>). In the cases, where full TeX/MathJax is used (e.g., https://journals.aps.org/prd/abstract/10.1103/PhysRevD.48.3190), currently GUI applications don't really implement that at all. In the case of Word, for example, though, the processor could tell Word to format this as an equation, because Word supports LaTeX syntax.

Of this, the one thing I don't agree with is math. Not that it'd be easy to fix, far from it, but it's not something that can be left to the tools that call citeproc because unicode (which is the only resort under the current circumstances) is just nowhere near expressive enough to get equivalents of even fairly simple math expressions.

Unicode does have a full-featured math syntax that is comparable to TeX or MathJax (https://www.unicode.org/notes/tn28/tn28-5.html). This is the syntax primarily used by the Microsoft Equation Editor.

bwiernik commented 4 years ago

tl;dr on my above comment, let's adopt the existing HTMLlish tags and also math with the $...$ syntax. Applications without the ability to process math can render as plain text, as they do currently.

denismaier commented 4 years ago

Or just <math>...</math> for math? Would be at least in line with the other tags.

bdarcus commented 4 years ago

What makes most sense to me is to decouple the structure issue from the formatting issue.

I agree, @jgm, but how do you ground this general approach in the specifics of this case?

What are the structures we're needing support, or an example of such a structure, from your perspective?

When I tend to think of structure vs presentation/formatting, for example, I think about a "title" or a "latin phrase" vs "italic".

But I'm not sure if that's what you mean.

let's adopt the existing HTMLlish tags and also math with the $...$ syntax. Applications without the ability to process math can render as plain text, as they do currently.

Probably we should split off a separate issue for embedded markup?

bwiernik commented 4 years ago

Or just <math>...</math> for math? Would be at least in line with the other tags.

Yeah, the $...$ was for markdown sorry.

bwiernik commented 4 years ago

Probably we should split off a separate issue for embedded markup?

We could, but the major question about YAML support is handling embedded markup. Beyond that, supporting CSL YAML in a way that is compatible with pandoc's current approach is the trivial extension of date formatting.

There are some constraints. Whatever format you use, certain operations would need to be defined in order for citations to be processed: in the library I'm working on, these include converting to and from plain text, dropping textual content from the beginning or end, moving punctuation inside quotes, and adding formatting (font variant, font style, etc.). In my library I use a type class to define the behaviors that are needed; for any format that can be used with the library, you'll need to define these transformations. One also needs a way to mark up content as "nocase."

Most of these are entirely other questions I think--e.g. moving punctuation inside quotes is behavior defined in the spec. Behavior defined by styles is a different issue than handling markup embedded into the data, which is what is at question.

bdarcus commented 4 years ago

We don't currently document embedded html, do we?

bwiernik commented 4 years ago

No.

bwiernik commented 4 years ago

(my reason for suggesting it as a canonical form of Markup is that it is the syntax used in a lot of CSL-JSON in the wild and it is easily-generatable from other markup syntaxes (e.g., rich text) )

jgm commented 4 years ago

Providing a limited set of allowed markup and a canonical form makes it easiest

I can't see pandoc changing and allowing only the weird HTML subset CSL-JSON currently supports for formatting. People are used to being able to use all the formatting at their disposal and having it work in bibliographies, and I'd like to keep that feature. If what you want is an exchange format, then existing CSL JSON works fine. Pandoc YAML bibliographies can be converted mechanically into that (with a loss of expressivity), and converted from that.

On math: this is a great example of what I'm talking about. There are several different ways to support math in documents -- TeX math and MathML being the two most common. (Just using unicode characters is not possible; you need something to indicate the complex structures of things like matrices, fractions, and limits.) Pandoc users will want to use TeX math for their math, and have that pass through successfully, since that's well supported in pandoc markdown. But if I'm designing a CSL workflow for DocBook, for example, I'll want to use MathML.

What are the structures we're needing support, or an example of such a structure, from your perspective?

Here's the typeclass definition I'm using now:

class (Semigroup a, Monoid a, Show a, Eq a, Ord a) => CiteprocOutput a where
  toText                      :: a -> Text
  fromText                    :: Text -> a
  dropTextWhile               :: (Char -> Bool) -> a -> a
  dropTextWhileEnd            :: (Char -> Bool) -> a -> a
  addFontVariant              :: FontVariant -> a -> a
  addFontStyle                :: FontStyle -> a -> a
  addFontWeight               :: FontWeight -> a -> a
  addTextDecoration           :: TextDecoration -> a -> a
  addVerticalAlign            :: VerticalAlign -> a -> a
  addTextCase                 :: TextCase -> a -> a
  addDisplay                  :: DisplayStyle -> a -> a
  addQuotes                   :: a -> a
  movePunctuationInsideQuotes :: a -> a
  mapText                     :: (Text -> Text) -> a -> 

This encodes everything the citeproc processor needs to know about the document format (here a) in order to process citation data encoded in format a. In another module, I define instances of this typeclass for a structure that maps precisely to CSL JSON (with the limited HTML tags it supports), and I use this for the tests. But I can also define instances for pandoc structures, and use these in pandoc. Hope that gives you a better idea what I'm talking about.

bwiernik commented 4 years ago

@jgm But how does pandoc-citeproc handle things like >, #?

bwiernik commented 4 years ago

If what you want is an exchange format, then existing CSL JSON works fine.

My concern, for example, is, for example, an author working in pandoc, who then needs to switch to working in Word to work with a collaborator. If their data are marked up in Markdown, then this no longer produces the correct output. That's an unpleasant experience and one which I fear will engender negative attitudes toward CSL or the applications implementing it.

Beyond math, what other markdown syntax are you expecting users to want to use other than italics, bold, sub/superscript, small caps, and nocasing? I think we can come to a common set of markup features that we can expect citeprocs and CSL applications to be expected to process.

jgm commented 4 years ago

My concern, for example, is, for example, an author working in pandoc, who then needs to switch to working in Word to work with a collaborator. If their data are marked up in Markdown, then this no longer produces the correct output.

If they need to produce a Word document with a formatted bibliography, this should work fine, since pandoc + pandoc-citeproc can translate all pandoc markdown features to Word reliably.

If they need to produce a CSL JSON bibliography, because of a workflow the collaborator has that uses this, then yes, they may lose something.

Beyond math, what other markdown syntax are you expecting users to want to use other than italics, bold, sub/superscript, small caps, and nocasing?

Pandoc's inline formatting allows, in addition to those things: underline, strikeout, structured quoted content (not just quotation marks which have to be paired using possibly unreliable heuristics), citations, code (preformatted text, perhaps with attributes), soft or hard line beraks, math (display or inline), hyperlinks, images, footnotes, spans with arbitrary attributes, and raw content in a target format. Of these things, the ones I think are most important for citations are hyperlinks, spans (which can be used for various purposes, e.g. marking up text as being in a certain language), code, and math.

But why stop at inline formatting? CSL bibliographies can contain abstracts. An abstract can in principle have more than one paragraph. An abstract might contain a bulleted list, footnotes, a code sample, a table, or even a figure. All of this goes way beyond what can be represented in CSL JSON.

I think it's fine if some people don't want to make all of that possible in bibliographic data. But I don't like the idea that it should be made impossible. That's why I suggested that the CSL YAML structure not dictate anything about the format filling the slots, except to specify some minimal conditions necessary for citation processing (e.g. it must be possible to put something in italics). That allows people to support more expressive formats, without requiring it. Interchange between more and less expressive formats will always be lossy, but that is true generally for document conversion -- it's not specific to the bibliographic parts -- so I don't see it as a big deal.

bwiernik commented 4 years ago

Personally, I don’t see why we couldn’t recommend or expect support for all of those things. I don’t really know why Frank chose the specific HTML tags that he did—probably a lack of imagination like mine. We could recommend a much wider array of HTML tags be supported; I don’t think we would get much pushback on that.

My two major points are—I think we should ask for text Markup to be compatible with a specified set of HTML-like tags. That makes HTML-like markup maximally transportable without translation. Second, I think we should strive to have a set of features we Agree that all applications and processors should support in some way. All of the features you list @jgm could be handled with rich text, which could be translated easily to HTML tags at a minimum. I’d like to strive for interoperability across implementations.

(The pandoc-Word using CSL-JSON compatibility isn’t really about one document necessarily—but someone with a Zotero library might use it with Word in one document, then use it with pandoc in another problem. The bibliographies generated be consistent as much as is possible.)

retorquere commented 4 years ago

Unicode does have a full-featured math syntax that is comparable to TeX or MathJax (https://www.unicode.org/notes/tn28/tn28-5.html).

How widely supported is this? I disagree with the author that this reads better than TeX -- those UnicodeMath expressions remind me (and not fondly) of APL -- but if it's widely supported (and citeproc would do the right thing when it encounters it), I'm going to look at my im/exporters to incorporate it.

bdarcus commented 4 years ago

I don’t really know why Frank chose the specific HTML tags that he did—probably a lack of imagination like mine.

These conversations go back more than a decade, and I think he just did the minimal that he thought would work.

We had then talked about the idea of inline semantic classes, but in retrospect, that's probably not needed, and the "flip-flopping" behavior for quotes and italics should go a long way.

We could recommend a much wider array of HTML tags be supported; I don’t think we would get much pushback on that.

I don't think we should be too focused on "tags" up front; that should be a secondary aspect for an HTML representation.

Think more in terms of language like:

CSL processors must parse field content on _____ input data for the following: 

- italics
- bold
- code
- quotes
- [whatever more we want to add] 

[insert description of what syntax the processor must/should parse; I would like to see both html and markdown]

[insert description of what the processor must do with that data, and with inline markup beyond the list above]

... I suggested that the CSL YAML structure not dictate anything about the format filling the slots, except to specify some minimal conditions necessary for citation processing (e.g. it must be possible to put something in italics).

So we need to specify some core subset necessary for citations, and we should not be prescriptive beyond that?