citation-js / bibtex-parser-experiments

Experiments to determine a new BibTeX parser formula for Citation.js -- to be applied to other formats as well
https://travis-ci.com/citation-js/bibtex-parser-experiments/builds
MIT License
4 stars 2 forks source link

closedown biblatex-cslc-onverter #7

Open johanneswilm opened 4 years ago

johanneswilm commented 4 years ago

Hey, I just discovered this chart. I have been participating in the maintenance of biblatex-csl-converter over the past few years. Based on your chart it looks like Idea (reworked) gives the same output quality as biblatex-csl-converter. Does that mean that it can be used as a drop in replacement and that it covers all the same features? If that is the case, is there any reason why I would continue to maintain biblatex-csl-converter?

larsgw commented 4 years ago

Does that mean that it can be used as a drop in replacement and that it covers all the same features?

Probably not, the Syntax column is a big simplification. The whole chart is meant as a way to compare different parser to replace the current one, and so is only compared on features the current one had or that I wanted for the new one. A number of differences, in terms of features, in idea-reworked, compared to biblatex-csl-converter:

So it definitely isn't a drop in replacement, as the API is quite different, and depending on your needs it may not be possible at all to switch.

johanneswilm commented 4 years ago

Ok, I understand. So "complete" doesn't mean "feature complete" but rather "completely covers what the other parser did"? Maybe that could be added somewhere as else it looks a bit misleading and users that may be better off using one the other parsers are lead to believe that they shouldn't. I'd prefer not to have to set up a different chart making counter claims, etc. . Speed isn't much of a concern for Fidus Writer's usecase of biblatex-csl-converter as it's totally fine to wait 250 ms for a single citation to be converted and even up to several minutes if a user uploads their entire mega collection as processing will happen entirely on that user's machine.

Accuracy is more important and also keeping maintenance costs down. So if there is another parser that can do the exact same but is maintained by someone else, I'd like to shut down biblatex-csl-converter. And if there isn't one, then I'd like for everyone else out there who needs the same functionality to contribute to biblatex-csl-converter so that we don't have to do all the maintenance by ourselves. That's why it would be nice to make sure people aren't mislead by that chart somehow.

And yes, please once you think your parser or one of the other ones covers all the features, let me know and I can see whether it still makes sense to put an AST converter on top and drop biblatex-csl-converter altogether.

larsgw commented 4 years ago

"complete" means nothing more and nothing less than that it parses syntax.bib accurately, which encompasses all the syntax I had in mind for the new parser (apart from syntax within values).

Maybe that could be added somewhere as else it looks a bit misleading and users that may be better off using one the other parsers are lead to believe that they shouldn't. I'd prefer not to have to set up a different chart making counter claims, etc.

That's fair, I just didn't really intend this repository for other users to make choices with. What's missing from the description is "the new BibTeX parser formula for Citation.js". And the comparisons where either because I wanted to see if my new parser was up to the task, or because someone asked me to add it to the comparison. But I definitely see where you're coming from, and you're not the only one, so I'll change it up and also add more detailed comparisons.

I can see whether it still makes sense to put an AST converter on top

I'm not really sure what you mean by this. How is an AST converter "on top", and if you'd be dropping biblatex-csl-converter where woud it be on top of?

johanneswilm commented 4 years ago

But I definitely see where you're coming from, and you're not the only one, so I'll change it up and also add more detailed comparisons.

Thank you very much for that. And yes, just a little bit of wording so that others understand what the purpose of the chart is and that it's not a full feature comparison of everything is all that I'm asking for. The comparison is still quite interesting.

I'm not really sure what you mean by this.

Sorry, let me reword. Currently biblatex-csl-converter outputs exactly the javascript object format we use internally in Fidus Writer. So if we switch to something else, then we'll probably need that parser + a converter from the output of that parser to the format we use internally in Fidus Writer. So there would be a bit of development cost creating this converter. That's all I was trying to say.

retorquere commented 4 years ago

I don't mean to pile on just to be antagonistic, but idea-reworked parses syntax.bib (which is invalid BTW -- biblatex chokes on it) into

[
  {
    type: 'book',
    label: 'sweig42',
    properties: {
      author: "Stefan Swe{\\i}g and Xavier D\\'ecoret",
      title: ' The {impossible} ℡—book ',
      publisher: ' D\\"ead Poₑeet Society',
      year: 1942,
      month: '03'
    }
  }
]

I don't know if I'm calling it wrong:

const parser = require('./lib/idea-reworked')
const fs = require('fs')
console.log(parser.parse(fs.readFileSync('test/files/syntax.bib', 'utf-8')))

but it doesn't seem to do diacritics replacement, anything with braces, and for the subscript interpretation it just picks up the first character. Also, biblatex ignores leading and trailing spaces so title and publisher should have been trimmed. And TEL is superscript?

retorquere commented 4 years ago

Wait, I got that wrong -- syntax.bib has double backslashes in the text, so it's not supposed to do diacritics conversions as there are none. Anyhow, that still leaves braces, subscript and superscript, and trimming.

larsgw commented 4 years ago

which is invalid BTW -- biblatex chokes on it

natbib should not, at least the last time I checked.

retorquere commented 4 years ago

which is invalid BTW -- biblatex chokes on it

natbib should not, at least the last time I checked.

Fair enough, it does.

* For superscript and subscript, I implemented it like that specifically but I don't know why. I'm converting them to Unicode characters which has limited support,

But that doesn't apply here -- a unicode subscript e does (clearly) exist, the parser just doesn't convert the other two es.

but I think CSL supports <sup> and <sub> markup.

It does. My parser converts to unicode sub/superscript where possible and uses <sup> and <sub> where that's not possible.

* `TEL` gets converted to the corresponding Unicode character in Zotero, which is were I got a lot of stuff from in the first version, and I kept it that way.

I don't really follow -- in syntax.bib I see TEL as \u54\u45\u4C, after conversion it show up as \u2121. The TEL in the input isn't a single character, it's a word, and title casing by a CSL style is going to affect it differently.

larsgw commented 4 years ago

• I found just transforming the first character (if it's supported) more consistent than to create a string with part sub/superscript and part normal text • Regarding TEL: that's the point (well, not the title casing) https://github.com/zotero/translators/blob/bae2057067e2fde076252a3b897a7e689a173c71/BibTeX.js#L1707

retorquere commented 4 years ago

• I found just transforming the first character (if it's supported) more consistent than to create a string with part sub/superscript and part normal text

$_{eee}$ should become either ₑₑₑ or <sub>eee</sub>, not ₑee. The braces mean that the entire string is subscript.

• Regarding TEL: that's the point (well, not the title casing) https://github.com/zotero/translators/blob/bae2057067e2fde076252a3b897a7e689a173c71/BibTeX.js#L1707

That table is a lossy mapping from unicode to ASCII TeX, you can't always revert this table for TeX to unicode mapping -- TEL being one such instance that should not be reversed. If the unicode char maps to a string that does not contain TeX-reserved characters, you generally do not want to use it as a reverse mapping.

retorquere commented 4 years ago

That table is a lossy mapping from unicode to ASCII TeX, you can't always revert this table for TeX to unicode mapping

Case in point: the reverse table is held separately here, and I would argue that the reverse mapping of {TEL} is a poor choice -- {TEL} means "the phrase TEL, not to be messed with in sentence casing". It does not mean "Telephone Sign" (which is the name of \u2121 in the unicode table).

johanneswilm commented 4 years ago

Interesting conversation you guys are having here.

but I think CSL supports <sup> and <sub> markup.

Does that mean this parser does not support the other html tags either? biblatex-csl-exporter currently supports these in CSL export:

const TAGS = {
    'strong': {open:'<b>', close: '</b>'},
    'em': {open:'<i>', close: '</i>'},
    'sub': {open:'<sub>', close: '</sub>'},
    'sup': {open:'<sup>', close: '</sup>'},
    'smallcaps': {open:'<span style="font-variant:small-caps;">', close: '</span>'},
    'nocase': {open:'<span class="nocase">', close: '</span>'},
    'enquote': {open:'“', close: '”'},
    'url': {open:'', close: ''},
    'undefined': {open:'[', close: ']'}
 }
retorquere commented 4 years ago

citeproc supports these; enquote and later in your table isn't markup so CSL won't mind. I can't find what CSL formally support, but everything that uses citeproc in its various incarnations will support the markup listed under that link.

johanneswilm commented 4 years ago

enquote and later in your table isn't markup so CSL

Right, because as far as I know, citeproc-js doesn't have any corresponding tag for these. All the other ones are in that list you are linking to.

retorquere commented 4 years ago

Correct.

johanneswilm commented 4 years ago

@retorquere Ah, now I understand your reply. My first comment on this here was not formulated very well. I updated it now. I wasn't asking whether citeproc supports it (I know it does), I was wondering about this parser.

larsgw commented 4 years ago

Does that mean this parser does not support the other html tags either?

It does, but not all the commands it seems (code):

const richTextMappings = {
  textit: 'i',
  textbf: 'b',
  textsc: 'sc',
  textsuperscript: 'sup',
  textsubscript: 'sub'
}
retorquere commented 4 years ago

That misses at least mkbibbold, bf and bfseries for bold, sl, em, it, itshape, mkbibitalic, mkbibemph, emph for italics, sc and scshape for smallcaps, and citeproc doesn't support <sc>, just <span style="font-variant: small-caps;">

Parsing stuff like {partially \bf bold} but not this is interesting (in the apocryphal Chinese sense) in that \bf affects everything after it until the end of the current block, so here, only the word bold should be bold. That sample is synthetic, just for illustration; in practice you'd see the much more sensible partially {\bf bold} but not this but here the interesting aspect is that here the braces do not mean nocase. If a block has a command at the start, it is ignored for case protection by bib(la)tex.

larsgw commented 4 years ago

Okay, that's some more things to add to the list. This does make me lean towards moving more parts of the parsing to earlier in the process.

retorquere commented 4 years ago

Okay, that's some more things to add to the list. This does make me lean towards moving more parts of the parsing to earlier in the process.

I don't see any other way this can be done. In a one-pass parser, it must be done during the parse, since you need the context to make these decision. In a two-pass parser like mine, the decision can be postponed until the 2nd pass.

For {partially \bf bold} but not this, are the braces still a nocase, since the \bf is not at the start of the block?

Yes:

\documentclass{article}
\usepackage[american]{babel}
\usepackage[backend=biber, style=apa]{biblatex}
\DeclareLanguageMapping{american}{american-apa}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}

@article{03, author = "03", 
title =    "{\bf Next: Bold}",
}

@article{04, author = "04", 
title =    "{Next: \bf Bold}",
}

@article{05, author = "05", 
title =    "{Next: Bold}",
}

\end{filecontents}
\addbibresource{\jobname.bib}
\begin{document}
\nocite{*}
\printbibliography
\end{document}

gives

  1. (n.d.). NEXT: BOLD.
  2. (n.d.). Next: Bold.
  3. (n.d.). Next: Bold.

<sc> was mentioned in (although not part of) the old specification, I think that is were I got it. It seems to still be included in some test cases

I think most will actually still support it, but it's out of spec (even if I think it looks better)

retorquere commented 4 years ago

I haven't used B-C-C in a while, but it always used to be noticeably faster than the BBT parser. I don't know why the latest tests don't bear this out.