Closed retorquere closed 8 years ago
How much of the math mode would you think should be supported? I assume something like $a_1^2$
will already be too complicated to be displayed correctly using CSL output, right? So maybe just plain $^X$
and $_X$
?
That's all I do. As a matter of fact I currently don't even check whether I'm in math mode (it gets complicated when you throw in combinations of \ensuremathmode
) but just apply those whenever I see them. My parser does support uses of them such as $_{bla bla \cmd{etc}}$
. Likewise, I just translate stuff like \\div
to the appropriate utf-8 char regardless of where I find it. Not perfect, conceptually, but I've not had complaints so far.
We do now support $_2$
. This bug is still open because \"o
is still not ö
.
I have a translation table you can use.
Incompletely parsed currently:
Oil \& Gas Journal
P\"{a}ckert
Escaped backslash followed by \{\}: \\{}
Adsorption of 2-chlorophenol on {Cu$_2$O(1\,1\,1)-Cu$_\mathrm{CUS}$}
Original paper in Russian: Pis'ma v Zhurnal \'Eksperimental'no\u{\i} i Teoretichesko\u{\i} Fiziki, 89 (2009), 478--482
$\mu$
$\langle 100 \rangle$
$25\,^\circ\mathrm{C}$
$400\,^\circ$C
Stanis{\l}aw
$T_\mathrm{c}$
{\"{O}}nsten, Anneli and M{\aa}nsson, Martin
$1150\,^\circ\mathrm{C}$
la cuprite \`a {$4,2\,^\circ\mathrm{K}$}. Essai d'interpr\'etation
Etude de l'effet Zeeman de la raie $n = 1$ de la s\'erie jaune de {Cu$_2$O} \`a {$20\,^\circ\mathrm{K}$}
{\'E}tude de l'effet photo-{H}all dans la cuprite \`a $79\,^\circ\mathrm{K}$
Beno\^it \`a la Guillame
At least one of the errors are probably related to the fact that you can't safely iterate a unicode string in javascript by going over its individual characters. Javascript strings are sort-of-UCS2, and multibyte characters are returned as two individual characters.
This is a different entry, yeah? Could you give me the source for that second entry?
each line is from a different entry. 2nd is from
@article{Frit2,
author = {Fritz, U. and Corti, C. and P\"{a}ckert, M.},
doi = {10.1007/s13127-011-0069-8},
journal = {Actes du $4^{\textrm{ème}}$ Congrès Français d'Acoustique},
pages = {71-80},
timestamp = {2015-02-24 12:14:36 +0100},
title = {Test of markupconversion: Italics, bold, superscript, subscript, and small caps: Mitochondrial DNA$_{\textrm{2}}$ sequences suggest unexpected phylogenetic position of Corso-Sardinian grass snakes (\textit{Natrix cetti}) and \textbf{do not} support their \textsc{species status}, with notes on phylogeography and subspecies delineation of grass snakes.},
volume = {12},
year = {2012}
}
(It'd be helpful for me if I could paste text into the demo rather than/as an option instead of having to upload files. Less clicking and pointing.)
@retorquere OK, you should now be able to paste
\textsuperscript{o}
is not translated.
Also:
\textasciitilde{}
\textsubscript{2}
\vphantom\{
should be dropped~
Proof of Structure of {{$\Delta$1,4-Pregnadiene-17$\alpha$,21-diol-3,11,20-trione and $\Delta$1,4-Pregnadiene-11$\beta$,17$\alpha$,21-triol-3,20-dione}}
Other than these 6 I think we're pretty close from BBTs perspective. Even found a bug in my own parser.
If I push in my own mapping table using the quicky patch below, all my existing (non-comprehensive) tests pass:
var tsc = require('./lib/import/const')
for (var latex in BBT.toUnicode) {
var unicode = BBT[latex]
latex = latex.replace(/\\/g, "\\\\").replace(/{/g, '\\{').replace(/}/g, '\\}').replace(/\(/g, '\\(').replace(/\)/g, '\\)')
tsc.TeXSpecialChars.push([latex, unicode])
}
I'll likely rewrite my mapping table to just look like yours so the line of replaces is pre-cooked.
... which means I'm doing something very wrong because vphantom
is not handled in that table. I must be generating false positives in some way. Odd.
the literal parser tries to remove commands it doesn't understand.
Should we be switching to your mapping table? I cannot remember why we didn't use yours so far.
The table is fairly comprehensive, but it's also pretty big. It'd be my preference if we could use it, but I could understand making parts optional/amendable.
Ok, maybe we can split it into two? One that covers 99% of uses and is max a few kb, and another that is X MB and covers the remaining percent?
We would need to be able to split it in a way so if one bundles the small version, it doesn't include the big version. So that people can run it in browsers, etc. .
Feasible?
Also, can we possible save some space and time by matching several versions at the same time? For example with and without braces?
new RegExp(`{${texChar[0]}}|${texChar[0]}`,'g')
The full table is about 290K, nowhere near a few M, so if that's the worry... I'm not really sure which part of the table constitutes 99%.
It should be possible to create a version that matches several versions at the same time, no issue. Not sure what the texChar[0]
means there though.
Is https://gist.github.com/anonymous/88a897554adcd7e4e38a3612fbe7ef5b a usable format for you? That's the whole table, at 180K.
That table encodes the unicode side as \u....
characters to sidestep the UCS2/UTF8 problem BTW. Less readable, but more predictable.
Hmm, not superlarge, but still... Could we compress it more some way? Have you tried how small it gets when compressed?
It's about 152K minified.
Note that the "left-hand-side" of these constructs come from the current BBT implementation, which expects to match the string at position 0, and my implementation only matches something like \space
when it is followed by a command delimiter (roughly, a non-alphanum char or string end). I don't actually know how the parser works, which is a bit of a worry for me. I won't be able to fix things.
A (very) few of them are perhaps not very useful though -- my mapping table is used two ways, and I doubt many people would know that {\fontencoding{LELA}\selectfont\char40}
generates Ħ or that \fontencoding{LELA}\selectfont\char201
generates Ŀ (and no, that's not a dead pixel on your screen). But if you enter Ħ, it will spit out that LaTeX, which will render that character.
Ok, I see. Yes, i guess this is what we need to work with. SO something like
\\\\boxslash\\{\\}|\\{\\\\boxslash\\}|\\\\boxslash
will match with 1. \boxslash{}
, 2. {\boxslash}
and 3. \boxslash
I currently only store 3, but run all of the regexps first with braces around them (that covers 2), and 1 is covered by 3 and the literal parser automatically removing empty braces.
The, main issue I have with the current setup of the package setup is that the order becomes very important. In the case of the tilde in our current list, I can already see that the order is incorrect and likely will cause problems. Given that you list is so large, it's probably even more difficult to ensure that none of them cause trouble.
Unless I come up with a clever idea on how to compress the list more, I think we should go for it.
Will ordering on descending length help? I can remove the duplicates.
I think so. And some of the very basic symbols that are being used by latex directly \,&,{,},$
probably shouldn't be found at all. The style parser should take care of those afterward. The tilde may be OK though.
I'll look into it.
Btw, this link was sent to me a few years ago from the biblatex developers. This seems to be the list of what biber supports. I haven't seen any similar list for the bibtex util:
I took a look at that list from biblatex-biber, but it doesn't specify what characters are in math mode. Useful for import, but not so much export.
Ok, great! Just to make sure: this list is available under an open license (lgpl or with fewer restrictions, such as BSD or MIT) and has been available for a while, right? With your permission, I will include it in the repository. I noticed a lot of Zotero code, including the name parser you were using, is AGPL (so is Fidus Writer).
All my code is under MIT. I don't really care about licensing, my code was public domain until someone tried to explain at length that was not really a thing, and it was less effort to MIT-license my code than try to understand his argument. This particular list isn't even my code in a classical sense, the list is generated from several sources by a Ruby script.
Yes, that person was me. :P I will add a note saying that the license is under the MIT and your (compositional) work. Sounds good?
Ok, what are the license terms of those sources? I get that in reality we likely only have a very tiny chances that someone comes running after us for infringing on their copyright for a list of tex characters, but better safe than sorry.
Well than you could have known how passionate I am about avoiding thinking about it 😄 . I'm fine with any license for the parser as long as I can use it. My motto, should I have taken the time to formulate one, is bound to look a lot more "better to ask forgiveness than to ask permission" than "better safe than sorry".
The scripts that generate this list are part of the BBT project, so they're also MIT. The code to produce this list isn't incredibly pretty, as it was a quick hack on the existent sources. At this stage the lists I use as source are actually stupidly stable, so very unlikely to change -- no idea whether it's still worth the effort to generate them, but I like coding over administrative work, so that will always drive me to write scripts when the opportunity arises.
The list sources are, in order (later sources override earlier sources when there's a conflict):
OK, great those are just data sources. So you are the holder of the copyright of the composition and you have published it under the MIT license. Great! I'll push now!
Super!
If I parse this:
and put it through the CSL Exporter, I get this:
(note the
\
commands, andCu$_2$O
should have become Cu2O)