fiduswriter / biblatex-csl-converter

A set of JavaScript converters: bib(la)tex => json, json => csl, and json => biblatex
GNU Lesser General Public License v3.0
34 stars 10 forks source link

The parser doesn't resolve LaTeX commands #3

Closed retorquere closed 7 years ago

retorquere commented 7 years ago

If I parse this:

@InCollection{Madelung_1998_LB_10681727_56,
Title = {Cuprous oxide ({Cu$_2$O}) crystal structure, lattice parameters},
Author = {Madelung, O. and others},
Booktitle = {{L}andolt-{B}\"ornstein},
Publisher = {Springer-Verlag},
Year = {1998},
Editor = {Madelung, O. and R\"ossler, U. and Schulz, M.},
Series = {SpringerMaterials - The Landolt-B\"ornstein Database},
Volume = {III/41c},
Doi = {10.1007/10681727_56},
File = {Madelung_1998_LB_10681727_56.pdf:CopperOxides\Madelung_1998_LB_10681727_56.pdf:PDF},
Owner = {Francesco},
Timestamp = {2010.02.22}
}

and put it through the CSL Exporter, I get this:

{ title: 'Cuprous oxide (<span class="nocase">Cu_2O</span>) crystal structure, lattice parameters',
     author: [ { family: 'Madelung', given: 'O.' }, { literal: 'others' } ],
     'container-title': '<span class="nocase">L</span>andolt-<span class="nocase">B</span>rornstein',
     publisher: [ 'Springer-Verlag' ],
     editor:
      [ { family: 'Madelung', given: 'O.' },
        { family: 'R\\"ossler', given: 'U.' },
        { family: 'Schulz', given: 'M.' } ],
     'collection-title': 'SpringerMaterials - The Landolt-B\\"ornstein Database',
     volume: 'III/41c',
     DOI: '10.1007/10681727_56',
     issued: { 'date-parts': [ 1998 ] },
     type: 'entry',
     id: '0' }

(note the \ commands, and Cu$_2$O should have become Cu2O)

johanneswilm commented 7 years ago

How much of the math mode would you think should be supported? I assume something like $a_1^2$ will already be too complicated to be displayed correctly using CSL output, right? So maybe just plain $^X$ and $_X$?

retorquere commented 7 years ago

That's all I do. As a matter of fact I currently don't even check whether I'm in math mode (it gets complicated when you throw in combinations of \ensuremathmode) but just apply those whenever I see them. My parser does support uses of them such as $_{bla bla \cmd{etc}}$. Likewise, I just translate stuff like \\div to the appropriate utf-8 char regardless of where I find it. Not perfect, conceptually, but I've not had complaints so far.

johanneswilm commented 7 years ago

We do now support $_2$. This bug is still open because \"o is still not ö.

retorquere commented 7 years ago

I have a translation table you can use.

retorquere commented 7 years ago

Incompletely parsed currently:

Oil \& Gas Journal
P\"{a}ckert
Escaped backslash followed by \{\}: \\{}
Adsorption of 2-chlorophenol on {Cu$_2$O(1\,1\,1)-Cu$_\mathrm{CUS}$}
Original paper in Russian: Pis'ma v Zhurnal \'Eksperimental'no\u{\i} i Teoretichesko\u{\i} Fiziki, 89 (2009), 478--482
$\mu$
$\langle 100 \rangle$
$25\,^\circ\mathrm{C}$
$400\,^\circ$C
Stanis{\l}aw
$T_\mathrm{c}$
{\"{O}}nsten, Anneli and M{\aa}nsson, Martin
$1150\,^\circ\mathrm{C}$
la cuprite \`a {$4,2\,^\circ\mathrm{K}$}. Essai d'interpr\'etation
Etude de l'effet Zeeman de la raie $n = 1$ de la s\'erie jaune de {Cu$_2$O} \`a {$20\,^\circ\mathrm{K}$}
{\'E}tude de l'effet photo-{H}all dans la cuprite \`a $79\,^\circ\mathrm{K}$
Beno\^it \`a la Guillame

At least one of the errors are probably related to the fact that you can't safely iterate a unicode string in javascript by going over its individual characters. Javascript strings are sort-of-UCS2, and multibyte characters are returned as two individual characters.

johanneswilm commented 7 years ago

This is a different entry, yeah? Could you give me the source for that second entry?

retorquere commented 7 years ago

each line is from a different entry. 2nd is from

@article{Frit2,
  author = {Fritz, U. and Corti, C. and P\"{a}ckert, M.},
  doi = {10.1007/s13127-011-0069-8},
  journal = {Actes du $4^{\textrm{ème}}$ Congrès Français d'Acoustique},
  pages = {71-80},
  timestamp = {2015-02-24 12:14:36 +0100},
  title = {Test of markupconversion: Italics, bold, superscript, subscript, and small caps: Mitochondrial DNA$_{\textrm{2}}$ sequences suggest unexpected phylogenetic position of Corso-Sardinian grass snakes (\textit{Natrix cetti}) and \textbf{do not} support their \textsc{species status}, with notes on phylogeography and subspecies delineation of grass snakes.},
  volume = {12},
  year = {2012}
}
retorquere commented 7 years ago

(It'd be helpful for me if I could paste text into the demo rather than/as an option instead of having to upload files. Less clicking and pointing.)

johanneswilm commented 7 years ago

@retorquere OK, you should now be able to paste

retorquere commented 7 years ago

\textsuperscript{o} is not translated.

retorquere commented 7 years ago

Also:

Other than these 6 I think we're pretty close from BBTs perspective. Even found a bug in my own parser.

retorquere commented 7 years ago

If I push in my own mapping table using the quicky patch below, all my existing (non-comprehensive) tests pass:

var tsc = require('./lib/import/const')
for (var latex in BBT.toUnicode) {
  var unicode = BBT[latex]
  latex = latex.replace(/\\/g, "\\\\").replace(/{/g, '\\{').replace(/}/g, '\\}').replace(/\(/g, '\\(').replace(/\)/g, '\\)')
  tsc.TeXSpecialChars.push([latex, unicode])
}

I'll likely rewrite my mapping table to just look like yours so the line of replaces is pre-cooked.

retorquere commented 7 years ago

... which means I'm doing something very wrong because vphantom is not handled in that table. I must be generating false positives in some way. Odd.

johanneswilm commented 7 years ago

the literal parser tries to remove commands it doesn't understand.

johanneswilm commented 7 years ago

Should we be switching to your mapping table? I cannot remember why we didn't use yours so far.

retorquere commented 7 years ago

The table is fairly comprehensive, but it's also pretty big. It'd be my preference if we could use it, but I could understand making parts optional/amendable.

johanneswilm commented 7 years ago

Ok, maybe we can split it into two? One that covers 99% of uses and is max a few kb, and another that is X MB and covers the remaining percent?

We would need to be able to split it in a way so if one bundles the small version, it doesn't include the big version. So that people can run it in browsers, etc. .

Feasible?

johanneswilm commented 7 years ago

Also, can we possible save some space and time by matching several versions at the same time? For example with and without braces?


new RegExp(`{${texChar[0]}}|${texChar[0]}`,'g')
retorquere commented 7 years ago

The full table is about 290K, nowhere near a few M, so if that's the worry... I'm not really sure which part of the table constitutes 99%.

It should be possible to create a version that matches several versions at the same time, no issue. Not sure what the texChar[0] means there though.

retorquere commented 7 years ago

Is https://gist.github.com/anonymous/88a897554adcd7e4e38a3612fbe7ef5b a usable format for you? That's the whole table, at 180K.

retorquere commented 7 years ago

That table encodes the unicode side as \u.... characters to sidestep the UCS2/UTF8 problem BTW. Less readable, but more predictable.

johanneswilm commented 7 years ago

Hmm, not superlarge, but still... Could we compress it more some way? Have you tried how small it gets when compressed?

retorquere commented 7 years ago

It's about 152K minified.

Note that the "left-hand-side" of these constructs come from the current BBT implementation, which expects to match the string at position 0, and my implementation only matches something like \space when it is followed by a command delimiter (roughly, a non-alphanum char or string end). I don't actually know how the parser works, which is a bit of a worry for me. I won't be able to fix things.

retorquere commented 7 years ago

A (very) few of them are perhaps not very useful though -- my mapping table is used two ways, and I doubt many people would know that {\fontencoding{LELA}\selectfont\char40} generates Ħ or that \fontencoding{LELA}\selectfont\char201 generates Ŀ (and no, that's not a dead pixel on your screen). But if you enter Ħ, it will spit out that LaTeX, which will render that character.

johanneswilm commented 7 years ago

Ok, I see. Yes, i guess this is what we need to work with. SO something like

\\\\boxslash\\{\\}|\\{\\\\boxslash\\}|\\\\boxslash

will match with 1. \boxslash{}, 2. {\boxslash} and 3. \boxslash

I currently only store 3, but run all of the regexps first with braces around them (that covers 2), and 1 is covered by 3 and the literal parser automatically removing empty braces.

The, main issue I have with the current setup of the package setup is that the order becomes very important. In the case of the tilde in our current list, I can already see that the order is incorrect and likely will cause problems. Given that you list is so large, it's probably even more difficult to ensure that none of them cause trouble.

Unless I come up with a clever idea on how to compress the list more, I think we should go for it.

retorquere commented 7 years ago

Will ordering on descending length help? I can remove the duplicates.

johanneswilm commented 7 years ago

I think so. And some of the very basic symbols that are being used by latex directly \,&,{,},$ probably shouldn't be found at all. The style parser should take care of those afterward. The tilde may be OK though.

retorquere commented 7 years ago

I'll look into it.

johanneswilm commented 7 years ago

Btw, this link was sent to me a few years ago from the biblatex developers. This seems to be the list of what biber supports. I haven't seen any similar list for the bibtex util:

http://downloads.sourceforge.net/project/biblatex-biber/biblatex-biber/1.7/documentation/utf8-macro-map.html

retorquere commented 7 years ago

updated map: https://gist.github.com/052c93b1bd4033cf1945b52d87e3c2c9

retorquere commented 7 years ago

I took a look at that list from biblatex-biber, but it doesn't specify what characters are in math mode. Useful for import, but not so much export.

johanneswilm commented 7 years ago

Ok, great! Just to make sure: this list is available under an open license (lgpl or with fewer restrictions, such as BSD or MIT) and has been available for a while, right? With your permission, I will include it in the repository. I noticed a lot of Zotero code, including the name parser you were using, is AGPL (so is Fidus Writer).

retorquere commented 7 years ago

All my code is under MIT. I don't really care about licensing, my code was public domain until someone tried to explain at length that was not really a thing, and it was less effort to MIT-license my code than try to understand his argument. This particular list isn't even my code in a classical sense, the list is generated from several sources by a Ruby script.

johanneswilm commented 7 years ago

Yes, that person was me. :P I will add a note saying that the license is under the MIT and your (compositional) work. Sounds good?

Ok, what are the license terms of those sources? I get that in reality we likely only have a very tiny chances that someone comes running after us for infringing on their copyright for a list of tex characters, but better safe than sorry.

retorquere commented 7 years ago

Well than you could have known how passionate I am about avoiding thinking about it 😄 . I'm fine with any license for the parser as long as I can use it. My motto, should I have taken the time to formulate one, is bound to look a lot more "better to ask forgiveness than to ask permission" than "better safe than sorry".

The scripts that generate this list are part of the BBT project, so they're also MIT. The code to produce this list isn't incredibly pretty, as it was a quick hack on the existent sources. At this stage the lists I use as source are actually stupidly stable, so very unlikely to change -- no idea whether it's still worth the effort to generate them, but I like coding over administrative work, so that will always drive me to write scripts when the opportunity arises.

The list sources are, in order (later sources override earlier sources when there's a conflict):

johanneswilm commented 7 years ago

OK, great those are just data sources. So you are the holder of the copyright of the composition and you have published it under the MIT license. Great! I'll push now!

retorquere commented 7 years ago

Super!