citation-style-language / schema

Citation Style Language schema
https://citationstyles.org/
MIT License
185 stars 61 forks source link

"Add" CSL YAML #278

Open bdarcus opened 4 years ago

bdarcus commented 4 years ago

This and this suggests we can use our JSON schemas to validate a YAML alternative.

Pandoc already supports a YAML alternative (cc @jgm).

I suggest we do something with this, since it's zero work for us, and would give more options for users and developers.

Perhaps most sensible option is just adding a sentence to the spec that mentions this possibility, without requiring implementations to support it?

Edit: actually, we say nothing about input in the spec currently. So we would need to add a section on input data, and say that our schema can validate either json or yaml.

Proposal

Based on this discussion, we should:

  1. review the existing json schema for any possible adjustments we might want to make now. Date parts are one obvious discrepancy; are there any others?
  2. add an optional field for the content markup format to parse; the html subset could be default, and we could enumerate other options; say markup: org.

Originally posted by @bdarcus in https://github.com/citation-style-language/schema/issues/277#issuecomment-650192158

bwiernik commented 4 years ago

How widely supported is this?

This is a bit off topic for this thread, so we can discuss elsewhere if you like. It's what Microsoft Office uses for equations, so it arguably is more widely used than TeX. I am definitely a much bigger fan of writing with this syntax than TeX, namely for its fraction syntax.

bwiernik commented 4 years ago

So we need to specify some core subset necessary for citations, and we should not be prescriptive beyond that?

In CSL-JSON at least, I think we should be prescriptive on syntax so that there can be a common interchange format for CSL applications. CSL data should "just work" regardless of which processor or application is using it. In CSL YAML, we could include an element to indicate what markup syntax is being used like John proposes, but then I think it will be really important for there to exist reference functions to convert from that markup syntax to the CSL-JSON syntax.

There is a major group of CSL users who (1) use a GUI program like Mendeley, ReadCube, or Zotero to curate their CSL data and (2) use this curated library to write in both Markdown and word processors like Google Docs or Word. There needs to be a way for users to have that workflow and not be expected to have their reformat their markup in their item data every time they use a piece of item data in a different application.

jgm commented 4 years ago

I agree completely about having a well-defined common interchange format in CSL JSON, and that it should be more expressive than it currently is (math is particularly important). I just think there's something to be said for also having a more human-centered format (CSL YAML) that does not specify a particular "content" format. If I'm using Markdown, I'll want my human-readable bibliography to be in Markdown. If I'm using Emacs Org-mode, I'll want it to be in Org markup. If I'm using JATS, I'll want it to be in JATS.

bdarcus commented 4 years ago

So we need to specify some core subset necessary for citations, and we should not be prescriptive beyond that?

In CSL-JSON at least, I think we should be prescriptive on syntax so that there can be a common interchange format for CSL applications.

Two things:

First, it occurs to me we might want to define what we mean by "prescriptive."

What would happen to my data if I add non-standard markup to, say, a title, in Zotero? Would it get stripped from the CSL json/yaml (as in, literally not allowed), or simply ignored by the csl processor (the markup; not the content)?

When I say it shouldn't be prescriptive, I am saying I am opposed to the former; not the latter of course.

You?

Second, just a reminder that pandoc is pretty much the most universal document converter in the world; I think we should seriously consider the perspective of its author.

bdarcus commented 4 years ago

Actually, we shouldn't really be caring what the app does; as Dan said elsewhere, that's between the devs and its users.

So really what I'm saying, I think, is a CSL processor shouldn't throw an error or strip content inside markup it doesn't understand.

bdarcus commented 4 years ago

I agree completely about having a well-defined common interchange format in CSL JSON, and that it should be more expressive than it currently is (math is particularly important). I just think there's something to be said for also having a more human-centered format (CSL YAML) that does not specify a particular "content" format.

So what would you suggest, specifically, @jgm?

  1. per @bwiernik, define a strict JSON representation, with a small subset of HTML, and say nothing at all about content for a YAML representation?
  2. something else, that maybe provides some bridge, and so does say something about the YAML representation

Also, do we need a field in the YAML to indicate which embedded markup, maybe with some default?

If I'm using Markdown, I'll want my human-readable bibliography to be in Markdown. If I'm using Emacs Org-mode, I'll want it to be in Org markup. If I'm using JATS, I'll want it to be in JATS.

I think this is indeed the bottom line.

retorquere commented 4 years ago

@bwiernik I'd love to discuss it further, but within the context of this discussion I meant: if a user would paste this into a title, is it realistic to expect that it'd show up in word/LO/Gdocs as math? If passed through pandoc, would it show up in most targets (but at least html, word, LO, maybe pdf)?

bwiernik commented 4 years ago

@retorquere In Word, yes. LibreOffice has its own equation syntax that nothing else uses; I don't know if it additionally supports TeX or Unicode entry. GDocs doesn't natively support equations at all; there is a MathType plugin.

bwiernik commented 4 years ago

So how about this:

Proposal

  1. We define a set of markup features that citeprocs should be able to process.
  2. We define the syntax for CSL-JSON, likely building on the existing HTML-like syntax
  3. Processors of CSL YAML should recognize and support the CSL-JSON markup syntax regardless of other markup format used
  4. CSL YAML includes a markup element indicating the type of markup used ("none", HTML, Markdown, TeX, Docbook, etc.) for other markup syntax.
    • Specifying "none" would allow a Zotero user, for example, to stop pandoc from parsing Markdown without having to escape the characters (which would interfere with using the data with a word processor).

Second, just a reminder that pandoc is pretty much the most universal document converter in the world; I think we should seriously consider the perspective of its author.

@jgm I mean no disrespect; sorry if I've come off otherwise. pandoc is the single most amazing piece of software I've encountered, and I use it daily. I'm just hoping to maintain as much compatibility for GUI users as possible.

jgm commented 4 years ago

LibreOffice has its own equation syntax that nothing else uses

LibreOffice (well, the underlying opendocument XML format) can handle MathML. That's a standardized way of marking up math. Pandoc has a library that converts between TeX, MathML, and OOXML (the representation Word uses), and also GNU eqn. Of these TEX and eqn are the only ones that humans can easily read/write.

jgm commented 4 years ago

@bwiernik -- no worries, no disrespect was ever felt! I understand your point of view and see the reasons behind it.

On the proposal: It seems to me that there are two separate issues, and maybe they're fairly independent of each other.

The first is whether CSL JSON needs additional syntax defined for the textual parts of fields, and what it should be. If you want CSL JSON to be a universal exchange format, so that it should be powerful enough to represent any bibliographic data one might want to work with, then you'd probably need to add quite a lot: a way of representing math, links and other kinds of inline formatting, and the block-level formatting one might need in e.g. an abstract. At this point one wonders, why not just say the content format is HTML5 (with some convention for math)? The current CSL JSON model is pretty weird -- certain tags are recognized, but outside of that tags are read literally.

The second issue is whether anything should be defined for CSL YAML. The point of this would be to be more human readable/writeable than the CSL JSON format. Here there are two sub-issues:

(a) Structure: should the structure be more forgiving or human friendly than that of CSL JSON, e.g. as regards dates?

(b) Content: should the syntax of the content parts be defined, or should that be a parameter (so that one can have CSL YAML Markdown or CSL YAML Docbook, for example)?

Sounds like you are now more favorable to allowing the format in (b) to be a parameter, and I'm happy about that. However, you say:

Processors of CSL YAML should recognize and support the CSL-JSON markup syntax regardless of other markup format used

I think that's going to be problematic, because this syntax won't blend will with some possible syntaxes. (Suppose the content syntax is LaTeX, for example: you'd then need a parser that can switch in and out of TeX or HTML-ish modes, it would really be a complex mess, I think. Handling raw HTML bits in Markdown is enough of a mess already.)

I don't really see the need for this additional constraint. As long as your CSL YAML document can be reliably converted to CSL JSON, the interchange aspect is covered.

dstillman commented 4 years ago

At this point one wonders, why not just say the content format is HTML5 (with some convention for math)? The current CSL JSON model is pretty weird -- certain tags are recognized, but outside of that tags are read literally.

It's really only weird if you view it as HTML, but it's not HTML. It's just borrowing some HTML tags as a simple way of conveying meaning to the processor that 1) people know how to use, if you don't want to embed a rich-text editor and 2) is unlikely to end up in the strings by accident. It's essentially just a humane text format. I discuss some of the problems around sanitization and stripping that would arise if it were actual HTML in the linked thread, but the more important point is that HTML is only one of the possible output formats for citeproc-js. Using arbitrary HTML5 doesn't help you generate RTF — then you just have HTML markup in your Word document.

jgm commented 4 years ago

On reflection, I'd actually be unhappy if full HTML5 were allowed, because that would make conversions to and from other formats even more difficult. But if you really support all the things I mentioned before, you've got a pretty good subset of HTML5. And I do think it's a bit weird to have a SGML-ish format that's not uniformly SGML-ish (e.g. unescaped < is allowed and <a> is just literal text).

Note that one tried-and-true approach to sanitization is to run a well-tested sanitizer over any final HTML produced by the whole system, rather than trying to make things impossible at the beginning.

bdarcus commented 4 years ago

@dstillman - do you have, or could you get, data on what tags Zotero users are using?

I have to believe that, beyond math, it's a small number.

dstillman commented 4 years ago

@bdarcus: What do you mean?

bdarcus commented 4 years ago

Also curious about what fields, beyond titles.

bdarcus commented 4 years ago

@bdarcus: What do you mean?

Sorry; I left out a word, which I just added; I meant Zotero users.

retorquere commented 4 years ago

On math: this is a great example of what I'm talking about. There are several different ways to support math in documents -- TeX math and MathML being the two most common. (Just using unicode characters is not possible; you need something to indicate the complex structures of things like matrices, fractions, and limits.)

The document that describes UnicodeMath seems to claim that it does have the required expressiveness to do this, but also states:

For interchange of math expressions between arbitrary math-aware programs, MathML and other higher-level languages are preferred. At the present time, conversion between UnicodeMath and other math formats is only implemented in Microsoft applications, although UnicodeMath isn’t proprietary

so that seems like a non-starter.

dstillman commented 4 years ago

@bdarcus: Sorry, I'm still not sure what the question is. People presumably use the tags that citeproc-js supports. Unless there's some thought of removing support for some of those, I'm not sure why the relative prevalance would be relevant. I think citeproc-js only processes those in titles.

dstillman commented 4 years ago

And I do think it's a bit weird to have a SGML-ish format that's not uniformly SGML-ish (e.g. unescaped < is allowed and <a> is just literal text).

Worth considering the context here: this format was added to citeproc-js as a way to allow users to mark up formatting in the plain-text title field in Zotero (and later other applications). The SGML-ish behaviors don't work in that context — e.g., if a user types a <, that can't cause a parsing failure in the processor, for obvious reasons.

CSL could require real HTML, or some other format, but that would just move the conversion step to every calling application. E.g., if a title came in from IEEE with an <i> in the title, Zotero would need to do the exact same parsing step that citeproc-js is doing now to parse out tags that should be supported and convert everything else to HTML, including encoding angled brackets and other special characters. And then you need to embed a full parser/sanitizer in the processor that you didn't need before, with no real benefit, because the processor still only supports specific tags, since it may not even be outputting HTML. It's easier to just treat the input as plain text with some easy-to-parse tags that happen to look like HTML.

(You could perhaps make an argument that the fact that citeproc-js generates RTF is citeproc-js's problem, and processors that only care about HTML output should be able to accept arbitrary HTML formatting as input and pass it through with appropriate sanitization, but given that RTF is likely by far the most common output mode for CSL, it seems like CSL's job to support the necessary formatting explicitly.)

bwiernik commented 4 years ago

@bdarcus Yes, beyond these formatting tags, I wouldn't think that much of any markup would currently be in Zotero fields because those would just be rendered as raw characters.

bdarcus commented 4 years ago

It's easier to just treat the input as plain text with some easy-to-parse tags that happen to look like HTML.

We say nothing about this in the spec, which is why this long thread. Effectively, from the spec perspective, everything is plain text.

So part of this discussion is saying let's fix that, partly because both the pandoc CSL YAML and the CSL JSON already have support for it.

If we do that, how would we explain the above part I highlighted?

Wouldn't it be better, from a spec perspective, to say CSL supports x, y, z sub-field formatting, that in the JSON should be well-formed HTML tags (so <i>this</i> is good, but <i>this<i> isn't)?

Because if we don't use that last phrase, then we have to formally specify what more lax syntax is appropriate?

And the i/b/sup/sub cases are trivial, but what happens with math?

(You could perhaps make an argument that the fact that citeproc-js generates RTF is citeproc-js's problem, and processors that only care about HTML output should be able to accept arbitrary HTML formatting as input and pass it through with appropriate sanitization, but given that RTF is likely by far the most common output mode for CSL, it seems like CSL's job to support the necessary formatting explicitly.)

From the very beginning, I designed CSL to be agnostic about output format. The first prototype I wrote, in fact, had different input and output drivers for different formats.

So definitely; it's a fundamental requirement of CSL.

But that's the challenge here.

bdarcus commented 4 years ago

@jgm I know there's been talk about adding sub/superscript formatting to commonmark, perhaps as an extension, and likely based on the current pandoc syntax. Has there been any movement on that?

With that syntax support, we have parallel support for the core features we need for this, in both markdown and html.

Seems only other thing we need beyond that is math, which pandoc also supports (using embedded LaTeX), and which HTML supports via MathML (or it seems, by the same syntax as pandoc in MathJax, which uses MathML internally, and so can also convert to MathML?).

MathJax demo.

Doesn't this gives us all we need to put something pretty simple and sensible together for the spec, that addresses all of these concerns?

Another Proposal

Sub-Field Formatting

CSL processors [should|must] support sub-field formatting on the following variables:

CSL processors [should|must] support the following formatting features on those variables:

In the JSON input format, these [should|must] be represented as well-formed HTML tags, with i, b, sub/sup. and code respectively, and "preserve case" represented as a span tag with a nocase class. Quotes can either be represented as plain strings, or using the q tag.

When formatting output, CSL processors [should|must] [insert flip-flopping rule as defined for quotes here to also cover italic, and maybe bold].


I think the above is sufficiently flexible, even if we adopted "must" langauge, but would move the ball forward in v1.1.

dstillman commented 4 years ago

And the i/b/sup/sub cases are trivial, but what happens with math?

I don't think it makes sense for the citation processor to be handling math at all.

For example, citeproc-js outputs HTML and RTF, but as far as I know there's no way to fully render math in RTF. Zotero takes the HTML or RTF output from citeproc-js and embeds it in Word, LibreOffice, or Google Docs, but it would likely need to handle math in different ways in those different programs. I would say the citation processor should ignore any math in the string — as it would anything else it didn't recognize — and it'd be up to the calling application to post-process the citation output and do what it wanted with the math.

So, e.g., Zotero might use MathJax in a rich-text editor, embed <math> tags with MathML in the title, possibly substitute numbered placeholders for the <math> tags to avoid any unexpected HTML/RTF processing, and then post-process the placeholders in the processor output with appropriate math handling when embedding the output in the current word processor.

dstillman commented 4 years ago

Put another way, even though there's a desire for CSL-JSON input to convey all possible semantic meaning in an application-agnostic way, it's just not possible when it comes to math, because the output format and/or word processor may not support it or may have specific handling requirements. Embedding MathML would presumably be the most consistent way of representing math in the HTML-ish input, but what do you gain if the most common output format for CSL doesn't even support it? If you export RTF, what would even happen to that input?

And even if you ignore that problem and focus on exporting to a format like HTML that could handle math, the calling application quite possibly needs a math processor anyway to convert/render user input and generate MathML for the processor, so what's the point of bundling an extra copy of (say) MathJax in the processor when it already exists in the application?

bdarcus commented 4 years ago

Put another way, even though there's a desire for CSL-JSON input to convey all possible semantic meaning in an application-agnostic way, it's just not possible when it comes to math

If that's the case, would definitely make it easier for us!

I removed math from my suggestion just above then, just to keep this moving.

bwiernik commented 4 years ago

To your proposal, I’d also add smallcaps (already supported with span syntax in citeproc-js and both span and <sc> syntax in pandoc; I’d add support for <smallcaps>) and strikethrough.

bdarcus commented 4 years ago

I've added a new linked issue to the documentation repo, specific to the sub-field formatting discussion that has mostly been the focus of this thread.

Hoping to do a PR later today so we can get more concrete. We've discussed all the issues, and now know enough to be specific, I think.

For this issue, whether to add a YAML representation that validates against the JSON schema, I think we should keep this open, and I think we should see if we can get it to work.to everyone's satisfaction.

I expect if we do this, it will result in one or more PRs on the JSON schema (say to @jgm's point on date representation), and possibly one on the documentation repo (simply to mention the YAML format, and that one can validate it against the JSON schema).

I've updated the top post to reflect what I think are next steps.

jgm commented 4 years ago

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

Math is tricky. I see the point of passing it through directly to the output format. However, if you're working with MathJax you often need to escape < as &lt; (to give just one example, see http://docs.mathjax.org/en/latest/input/tex/html.html) . Currently this would be completely garbled, since CSL doesn't recognize entities as such. So if you did the right thing and wrote x&lt;y, then CSL would garble it, but if you wrote x<y then it wouldn't work properly in your HTML output because <y would be interpreted by the browser as a tag.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

bwiernik commented 4 years ago

Thinking about Zotero–pandoc compatibility as the major concern I have, that usually happens via interface with BBT. If there is a defined set of HTML-like tags that Zotero supports, I think BBT or similar export translators could convert those tags to Markdown syntax fairly easily.

For math, I don’t see a word processor plugin directly supporting math input. But, if a user stored it in TeX, they could convert that to a Word OOML equation as a simple post processing step, and it would work out of the box with pandoc.

bdarcus commented 4 years ago

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

I have no firm position against these. I just wanted to keep this moving forward, and wasn't really focused on those cases because they're not the primary requirement for manuscript preparation.

But certainly we should consider them.

bwiernik commented 4 years ago

The two fields where this might come into play are abstract and note (e.g., used in annotated bibliographies for example). I could definitely see line breaks in both of those. Currently citeproc-js just disregards line breaks and renders without any white space at all.

Abstracts often have subheadings—those are usually set with bold, rather than heading markers.

Abstracts probably won’t have lists or tables, but note might. Formally supporting that might be ought of scope? But could be nice. I don’t know if rich text supports these (same with links?), so that might a thing left to individual applications/processors to decide.

dstillman commented 4 years ago

possibly substitute numbered placeholders for the tags to avoid any unexpected HTML/RTF processing

This thing I said above didn't really make sense — it would work for a one-off bibliography but not if we were embedding CSL-JSON in a document for future processing.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

I don't think passing anything through verbatim (as in, not processed according to the output format) actually works — depending on the output format, it could very well mean invalid/unescaped markup, and if the calling application doesn't know about it and deal with it appropriately, it's potentially a security flaw.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

I don't think processing MathML should be the citation processor's job, for the reasons I give above: the output format abilities may not be the same as the target application abilities, it would require a duplicate bundled math processor, and it just seems generally unreasonable to ask of a citation processor.

But a version of this might work:

1) Require MathML, and expect the citation processor to support an API for math handling. There's no reason citeproc-js needs a copy of MathJax — it just needs Zotero to provide a function that runs the MathML through its own copy of MathJax and return the necessary output for the format and word processor being used. If a math processor isn't provided, the citation processor could use the annotation field if present or embed/throw an error if not.

2) Support some generic mechanism for embedding typed content that needed to be handled by the calling application. This could potentially even be used for abstracts — instead of adding support for more rich-text formatting to CSL, the processor could call a function provided by the application that took type="html", data="<marquee>What a paper!</marquee>", output="RTF" to a function provided by the application that returned an appropriate string to insert into the output (which now could even be a placeholder for post-processing by the application). The same could be used for passing math — e.g., type="mathml", data="<math>…</math>", output="RTF", and it would be up to the calling application what to do if the output format and/or target application couldn't handle math. If an appropriate handler wasn't available, it could be handled as regular text (e.g., HTML-encoded) or an error could be thrown/embedded by the processor. This would keep the citation processor from needing to bundle huge processors that the calling application likely already has (a math processor, a HTML parser/sanitizer, etc.).

bwiernik commented 4 years ago

For the case of Zotero's Word integration, would either of those solutions enable, for example, the title and abstract of this item to appear in Word as a math environment equation?

dstillman commented 4 years ago

I actually kind of doubt it — I suspect you can't embed an equation element in the text of a Word field, which is what we would need to do. So while I don't know for sure, realistically output="RTF" probably means trying to convert MathML to UnicodeMath or AsciiMath. Still, that seems like more of a problem for a calling application that wants to deal with it — and which might need to do it in other contexts as well — than for a citation processor.

bwiernik commented 4 years ago

It seems like math, tables, lists might be beyond the scope of CSL; these might be things that we recommend applications support (e.g., math everywhere, tables and lists in abstract and note), but that is really up to the application to define?

bwiernik commented 4 years ago

@dstillman UnicodeMath would be a good compromise to be able to convert unlinked citations/bibliographies to equations with one click or a macro

bdarcus commented 4 years ago

Date issue now hopefully solved with the EDTF addition.

Do we want to add the optional property for the markup that @bwiernik suggested?

If yes, suggested values?

The above make sense because they have citation support in their ecosystems (and org will be getting native citation support soon).

Not sure if any others would apply?

LaTeX, but that seems a PITA to support, and superfluous given bibtex/biblatex?

bwiernik commented 4 years ago

Could we just leave that an open field and leave it up to processors to designate the markup they support?

bdarcus commented 4 years ago

Not sure, but I suppose.

bdarcus commented 4 years ago

@larsgw suggested in this comment that we consider having two input schemas: one for humans (yaml + edtf), and the other for machines (json + structured data object).

I wasn't sure how easy or possible this was in json schema, ~but the below appears (though I am not 100% certain) to work.~

Edit: no, it's not possible. it seems. In that case, we should probably just continue as planned.

bdarcus commented 4 years ago

Also, @jgm, am I correct that your current date model supports ranges? If yes how do you define an open-ended range?

jgm commented 4 years ago

It seems that this works with pandoc-citeproc to specify an open range:

  issued:
  - year: 2042
  - {}

But I wouldn't worry too much about my data model, since I'm planning to transition eventually to the new citeproc library I'm writing. It already passes more citeproc tests than pandoc-citeproc, and it's much faster and more maintainable. It uses the date-parts model that is part of current CSL.

denismaier commented 4 years ago

Wow, that was fast. Do you already have more concrete plans when we can expect the new library?

bdarcus commented 4 years ago

It seems that this works with pandoc-citeproc to specify an open range:

  issued:
  - year: 2042
  - {}

Here's what I have in #301 @jgm:

issued:
- date-parts:
  - 2000
- {}

So it merges your model and the 1.0 JSON model to match the EDTF model (which is a date, and in levels 0 and 1, a date range, which is two-item list of dates).

I believe date parts is better as an object (as you have), but I guess for compatibility we should keep the array. Anyone want to make the argument we should change this too? If yes, please state your case on #301. If not, we'll keep as is.

The human-readable preference, of course, would be the preferred EDTF string:

issued: 2000/..
bdarcus commented 2 years ago

Now close to two years later, I merged today #420, with examples of validating completion against the current v1.1 branch version of the schema (that allows EDTF for dates). It actually works pretty well for humans and machines, I'd say.

Much of this long thread contains very useful thoughts on a more narrow aspect of this; the question of the markup, etc. within the fields. #315 was an experiment for that, though I have no idea if the idea is any good.