Open retorquere opened 5 years ago
does not seem to support all diacritics (errors on {\r a}) is fixed now
I've updated the package and will update the surrounding documentation.
does not seem to support chained concatenations (a # b # c) I can't replicate this
I think I removed a part of the test file which removed the concatenation but that also removed the real culprit: it throws for {\i}
and {\'i}
which both work for natbib
(on my Overleaf anyway).
I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect. I've added this testcase but that passes for me.
As an aside, it's a simple fix for me to add missing diacritics (or other constructs), but {\i}
and {\'i}
are in my mapping, so I don't currently know which diacritics "Misses some forms of diacritics" refers to.
BTW, as far as completeness testing goes, I'd suggest testing at least
and optionally
BTW the BBT parser builds on the astrocite parser, parts of which are by my hand, but the BBT parser will therefore necessarily be slower than astrocite. I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).
I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect.
It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)
BTW, as far as completeness testing goes, I'd suggest testing at least
I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense. I get that that makes it a bit of an unfair comparison, especially performance-wise, but I didn't mean this repository as a way of calling people out, just to see if my results were somewhat adequate.
I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).
I'm not trying to convince you to switch, either. I knew my old parser was bad and I wanted to see which method of parsing was best for my purposes (PEG.js, nearley-js, rolling my own parser, etc.). There isn't much postprocessing going on, it converts commands to Unicode, concatenates fields, and puts everything into an object. No conversion to CSL though, but it's not an AST and I understand if this is too much pre-processing.
{
type: String,
label: String,
properties: {
field: "value" // note: this is verbatim, except command -> Unicode
}
}
I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense.
Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them. EDTF dates can indeed be done later, but sentence casing and name parsing must be done at the AST level:
I Like {ISDN} Heaps Better than {dial-up}
must know that ISDN
and dial-up
are exempt from case meddling. Having I Like ISDN Heaps Better than dial-up
in CSL does not mean the same, and CSL styles that demand titles are in sentence case would produce the wrong rendering.{Bausch and Lomb}
and {{Bausch and Lomb}}
are not the same when parsing lists (such as names).It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.
It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)
This is now fixed.
(I tried parsing syntax.bib, but it'd require changes to the astrocite parser, and it isn't valid bib(la)tex it seems; overleaf chokes on it at least)
Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them.
Right, I mixed that up. It's in #3 (checkbox 3).
sentence casing and name parsing must be done at the AST level
Braces in values are kept for that reason (except around some diacritic commands, as D<span class="nocase">é</span>coret
is a bit over the top for {\' e}
, in my opinion). Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.
It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.
True, but I think this is well enough for my intended purposes. I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.
Braces in values are kept for that reason (except around some diacritic commands, as
D<span class="nocase">é</span>coret
is a bit over the top for{\' e}
, in my opinion).
Sure. But it's more complicated than that; the braces usually, but not always, mean nocase. See https://retorque.re/zotero-better-bibtex/support/faq/#why-the-double-braces for some examples and links to details. And then there's still the point that lists (literal lists and names) can only be properly distinguished at the grammar level. nocase
isn't appropriate there.
Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.
I don't know what Bib.TXT is BTW.
True, but I think this is well enough for my intended purposes.
Can't argue that of course, but then "complete" doesn't mean a whole lot. But at the least, footnote 5 has been fixed now, unless there's more diacritics I missed.
I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.
Cool. BTW if name parsing and the meaning of braces (nocase or not) happens inside the parser, and the parser also converts markup (such as superscript, emph, etc), an AST may not be required. But I found it easier to do those by transforming the AST; that's actually what the BBT parser adds to the astrocite parser. The actual grammar is just that of astrocite, although I did add changes to the astrocite parser to be able to parse my test suite.
My parser also adds a simple form of error recovery BTW. The astrocite parser is an all-or-nothing parser. The BBT parser will parse entries one by one and give some info on entries that failed to parse.
I don't know what Bib.TXT is BTW.
Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.
Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.
The BBT parser will parse entries one by one and give some info on entries that failed to parse.
I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?
Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.
I mean... if you're leaning that way, wouldn't TOML or YAML make more sense? At least the more naive parsers (which can sometimes be useful) become trivial.
Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.
It may be that we see the meaning of "values" differently. For a title, HTML markup will mostly do, as long as the actual intent (which is, as noted, non-trivial) comes through. But name-lists and literal-lists are not strings, they're lists of strings, and you can't safely deduce where they're to be broken into parts without passing on the structure.
I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?
An unclosed open brace will consume all the input after it, yes, but all other errors (also unexpected closing braces) will skip ahead to the first @
it can find and attempt reparsing from that point on, repeatedly, until all input is parsed or consumed this way. So it will generally report and skip the smallest error it can, with the worst case being a single unpaired open brace.
This mixes tokens (lowercase) and rules (capitalized), but that could be changed as long as there are no naming conflicts.
@book{label, title = "{T}est" }
{ kind: 'Main', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, children: [ { kind: 'Entry', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, children: [ { kind: 'at', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 1, line: 1, col: 2 } }, value: '@' }, { kind: 'dataEntryType', loc: { start: { offset: 1, line: 1, col: 2 }, end: { offset: 5, line: 1, col: 6 } }, value: 'book' }, { kind: 'lbrace', loc: { start: { offset: 5, line: 1, col: 6 }, end: { offset: 6, line: 1, col: 7 } }, value: '{' }, { kind: 'label', loc: { start: { offset: 6, line: 1, col: 7 }, end: { offset: 11, line: 1, col: 12 } }, value: 'label' }, { kind: 'comma', loc: { start: { offset: 11, line: 1, col: 12 }, end: { offset: 12, line: 1, col: 13 } }, value: ',' }, { kind: '_', loc: { start: { offset: 12, line: 1, col: 13 }, end: { offset: 15, line: 2, col: 2 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 12, line: 1, col: 13 }, end: { offset: 15, line: 2, col: 2 } }, value: '\n ' } ], value: undefined }, { kind: 'EntryBody', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'Field', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'identifier', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 20, line: 2, col: 8 } }, value: 'title' }, { kind: '_', loc: { start: { offset: 20, line: 2, col: 8 }, end: { offset: 21, line: 2, col: 9 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 20, line: 2, col: 8 }, end: { offset: 21, line: 2, col: 9 } }, value: ' ' } ], value: undefined }, { kind: 'equals', loc: { start: { offset: 21, line: 2, col: 9 }, end: { offset: 22, line: 2, col: 10 } }, value: '=' }, { kind: '_', loc: { start: { offset: 22, line: 2, col: 10 }, end: { offset: 23, line: 2, col: 11 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 22, line: 2, col: 10 }, end: { offset: 23, line: 2, col: 11 } }, value: ' ' } ], value: undefined }, { kind: 'Expression', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'ExpressionPart', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 31, line: 2, col: 19 } }, children: [ { kind: 'QuoteString', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 31, line: 2, col: 19 } }, children: [ { kind: 'quote', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 24, line: 2, col: 12 } }, value: '"' }, { kind: 'Text', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 27, line: 2, col: 15 } }, children: [ { kind: 'BracketString', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 27, line: 2, col: 15 } }, children: [ { kind: 'lbrace', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 25, line: 2, col: 13 } }, value: '{' }, { kind: 'Text', loc: { start: { offset: 25, line: 2, col: 13 }, end: { offset: 26, line: 2, col: 14 } }, children: [ { kind: 'text', loc: { start: { offset: 25, line: 2, col: 13 }, end: { offset: 26, line: 2, col: 14 } }, value: 'T' } ], value: 'T' }, { kind: 'rbrace', loc: { start: { offset: 26, line: 2, col: 14 }, end: { offset: 27, line: 2, col: 15 } }, value: '}' } ], value: 'T' } ], value: '{T}' }, { kind: 'Text', loc: { start: { offset: 27, line: 2, col: 15 }, end: { offset: 30, line: 2, col: 18 } }, children: [ { kind: 'text', loc: { start: { offset: 27, line: 2, col: 15 }, end: { offset: 30, line: 2, col: 18 } }, value: 'est' } ], value: 'est' }, { kind: 'quote', loc: { start: { offset: 30, line: 2, col: 18 }, end: { offset: 31, line: 2, col: 19 } }, value: '"' } ], value: '{T}est' } ], value: '{T}est' }, { kind: '_', loc: { start: { offset: 31, line: 2, col: 19 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 31, line: 2, col: 19 }, end: { offset: 32, line: 3, col: 0 } }, value: '\n' } ], value: undefined } ], value: '{T}est' } ], value: [ 'title', '{T}est' ] } ], value: { title: '{T}est' } }, { kind: 'rbrace', loc: { start: { offset: 32, line: 3, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, value: '}' } ], value: { type: 'book', label: 'label', properties: { title: '{T}est' } } } ], value: [ { type: 'book', label: 'label', properties: { title: '{T}est' } } ] }
What is being mixed? I don't understand? This is the AST produced by the new idea parser?
This is the AST produced by the new idea parser?
Yes.
What is being mixed?
I'm using a tokenizer (moo) which splits up the text into parts like lbrace
and at
and text
based on where it's at in the file. Then, the rules are defined based on those tokens instead of individual characters, which helped a lot with the performance on abstracts for example. However, the AST has both rules (as branches) and tokens (as leaves) with no real distinction except their name and their position in the tree.
I see. But as far as I can tell, the tokens should be easy enough to filter out, and that should leave a fairly clean nested AST, which I could then inspect and transform.
Can I play with this? I am curious what {Bausch and Lomb}
and {{Bausch and Lomb}}
would return. From what I see above I suspect I'd get something where I can see the difference between these two and
s.
How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.
Error recovery is separate in my parser BTW. If it can be built into the idea parser it will almost certainly be faster, but if not, I could just keep my existing one; the error recovery works by chunking the input into individual entries/strings/comments, then parsing these individually with the astrocite parser, then reassembling the results (a.o. by replacing references to @string
s with the AST of those @strings
.
Can I play with this? I am curious what
{Bausch and Lomb}
and{{Bausch and Lomb}}
would return. From what I see above I suspect I'd get something where I can see the difference between these twoand
s.
I'll push the changes to ast
.
How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.
In principle just by adding files to the test/files/
directory. I updated the test suite (in the ast
branch) so it works for a single parser and numerous files instead of many parsers and a few files. You can run npm test
to run the parser on every file in test/files/
. You can also run
node test/ast.js test/files/single.bib
to get a single file's AST output. Note that those can be pretty long, longer than my terminals scrollback anyway. For the sake of brevity, the updated test suite only prints success on success.
if I do
node test/ast.js test/files/syntax.bib
I get
[
{
"type": "book",
"label": "sweig42",
"properties": {
"author": "Stefan Swe{\\i}g and Xavier D\\'ecoret",
"title": " The {impossible} ℡—book ",
"publisher": " D\\\"ead Poₑeet Society",
"year": 1942,
"month": "03"
}
}
]
which isn't what I expected. Should this have been the AST?
Did you run npm run babel
to update lib/
first?
Right, now it gives me the AST.
It parses most of my test suite files, with these exceptions:
../bibtex/tests/better-bibtex/export/Really Big whopping library.bib
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
../bibtex/tests/better-bibtex/import/Async import, large library #720.bib
Error: invalid syntax at line 64197 col 1:
@inproceedings{Mills2012a,
^
../bibtex/tests/better-bibtex/import/Endnote should parse.bib
SyntaxError: expected "comma", got "label" at line 3 col 11:
author =
^ (Main->Entry)
../bibtex/tests/better-bibtex/import/Import Jabref fileDirectory, unexpected reference type #1058.bib
SyntaxError: expected "comma", got "label" at line 33 col 23:
@Comment{jabref-meta: databaseType:bibtex;}
^ (Main->Entry)
../bibtex/tests/better-bibtex/import/Jabref groups import does not work #717.3.8.bib
SyntaxError: expected "comma", got "label" at line 36 col 23:
@Comment{jabref-meta: databaseType:bibtex;}
^ (Main->Entry)
../bibtex/tests/better-bibtex/import/Maintain the JabRef group and subgroup structure when importing a BibTeX db #97.bib
Error: invalid syntax at line 9242 col 52:
results for Z~S/CUZO and 7.nO/Cu20 heterojunctions.},
^
../bibtex/tests/better-bibtex/import/Some bibtex entries quietly discarded on import from bib file #873.bib
SyntaxError: expected "lbrace", got "label" at line 1954 col 10:
@Comment Len
^ (Main->Entry)
cleanup of the AST will be a bit of work, I'll take a look in the weekend.
I'll look at the test results this weekend as well.
One other thing that the chunker adds is optional async BTW. It's not really "background" async, but it will yield to the event loop after every chunk which allows other tasks to interleave with parsing.
Oh and wrt verbatim fields, mendeley gets this wrong for eg file
fields so my parser has an option to choose whether file fields are verbatim or not.
At one time, endnote also exported items without citation keys. There's a ton of real-life crap in my test suite - just because it parses doesn't necessarily mean the meaning is extracted properly.
BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e
Oh and wrt verbatim fields, mendeley gets this wrong for eg file fields so my parser has an option to choose whether file fields are verbatim or not.
Good idea, I'll make them configurable when I implement it.
BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e
Cool! I'll add the figures (and/or the test suite) to the repo.
Another thing (just added to @retorquere/bibtex-parser
): only "engl-ish" (english
and some variants like usenglish
, american
, etc) should be sentence cased on import.
The BBT parser has been updated -- {\emph same}
wasn't recognized properly. I didn't think anyone would ever use this, but it's in my test suite. It behaves differently from {\it same}
BTW. {\it same}
italicizes same
, {\emph same}
italicizes just the s
.
On two of my test files, at least astrocite runs out of memory, where my parser will parse them correctly (if slowly, they're 8.2Mb and 11Mb respectively).
Does the citation-js parser handle verbatim fields (like url
and file
) and verbatim commands (\url
, \href
, probably others)?
A few things recently fixed in the BBT parser that citation-js may not yet be aware of:
$\frac n 2 + 5$
is valid, and equivalent to $\frac{n}{2} + 5$
<
and >
mean different things depending on whether you're parsing in math mode or text mode.BBT has its own AST parser now, which is based on a version of astrocite grammar but has seen substantial (and incompatible) changes since.
It still seems strange to me to label parsers "complete" merely because they don't crash. Name parsing, verbatim fields, title-sentence casing, command-argument handling are all crucial parts of parsing bibtex. I'd wager that none of the "complete" parsers will parse {Bausch and Lomb}
vs {{Bausch and Lomb}}
correctly, or handle $\frac n 2 + 5$
properly.
Nice on the updated tests! BBT 3.1.20 fixes all non-gimmick tests and some gimmick tests.
What do you think the state of idea-reworked is now? Given how fast it is I may want to build on it, but I'd need to be able to pass my own test suite.
The main part missing from idea-reworked right now is the actual mapping to CSL or other output formats too. That includes field information as well, such as url
and verbatim
fields and automatic recognition of list
fields. And for that, I need some distinction between natbib and biblatex, as they have minor differences in syntax. Note: I am aware that a lot of this is minor edge cases (apart from the field information).
I have been working on mappings over at the aptly named bibtex-mappings, I don't remember if I linked it before. The repository contains some data text-mined from documentation (the biblatex docs are especially usable for this) to be combined with hand-crafted mappings.
I understand why you'd want mapping to other objects, but I just want the parsed object (pretty much what _intoFixtureOutput delivers), and I'll take it from there, as I'm targeting specifically conversion to Zotero objects.
The command concatenation gimmick would be pretty difficult to address in my parser, but to me that wouldn't be any kind of priority. It's interesting to see that your parser can deal with it successfully, but it's not something I expect to see in the wild.
Do you happen to have some documentation for the extended name format? I am working on name parsing now and I did find 3.6 Data Annotations in the BibLaTeX manual but that's slightly different from how the feature fixture you added works.
I don't have docs handy, no, and maybe I misunderstood it when I built it. What difference do you see?
Apparently, what you have works but I have not found it in the manual yet. I did find this, on page 82 in http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf:
@MISC{ann1,
AUTHOR = {Last1, First1 and Last2, First2 and Last3, First3},
AUTHOR+an = {1:family=student;2=corresponding}
}
But the name-parts are not overwritten by the annotation.
There's an example of what you implemented here: https://github.com/plk/biblatex/blob/dev/doc/latex/biblatex/examples/93-nameparts.tex
But the name-parts are not overwritten by the annotation.
Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.
I updated the feature fixtures to include all the name parts instead of just the last name, and on one I encountered unexpected \u0004
characters in BBT's output. They seem to come from https://github.com/retorquere/bibtex-parser/blob/f41af75fd9350507279b42078d07de1187699455/index.ts#L63-L67, when is that used for specifically? Should it still be there in the output?
Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.
I think you're right, still a bit confused about the annotation in the example though. Why would someone annotate specifically the family part of the name with "student"?
I can't say with certainty, but this looks to me like a synthetic sample meant to show what's possible with annotations, more than an actual sample from an actual annotated bibliography.
Those 0004 chars should not be in the output, I'll look into that.
If it helps, I saw it when there were braces in explicit name part values in the extended name format:
@article{test,
author = {family=Duchamp, given=Philippe, given-i={Ph}}
}
Thanks, that is fixed in the latest release.
I'm also tinkering with chevrotain to remove a pass from my parser.
Cool! I think I might have heard of chevrotain before but I do not recognize the website... the uppercase function names seem familiar though.
I've tried chevrotain, but if your test results are anything to go by, your parser is 2-3 times faster than my lexer alone. I can't replicate your results because npm install
fails for me, but clearly I should be looking to use your parser for speed. What's the current state of things? I see that moo
is only used "for now", do you intend to remove that dependency?
WRT the issues reported here on the BBT parser:
The sample below imports in BBT since 5.1.154
but the concat part of the title imported before that too.