citation-js / bibtex-parser-experiments

Experiments to determine a new BibTeX parser formula for Citation.js -- to be applied to other formats as well
https://travis-ci.com/citation-js/bibtex-parser-experiments/builds
MIT License
4 stars 2 forks source link

BBT parser #4

Open retorquere opened 4 years ago

retorquere commented 4 years ago

WRT the issues reported here on the BBT parser:

The sample below imports in BBT since 5.1.154

@string{j = {a space between this }}
@string{a = { string a}}
@string{b = { string b}}
@string{c = { string c}}
@article{key,
    author  = "Author",
    title   = "{\r a}Title" # a # b # c,
    year    = 1990,
    journal = j # "and this"
}

but the concat part of the title imported before that too.

larsgw commented 4 years ago

does not seem to support all diacritics (errors on {\r a}) is fixed now

I've updated the package and will update the surrounding documentation.

does not seem to support chained concatenations (a # b # c) I can't replicate this

I think I removed a part of the test file which removed the concatenation but that also removed the real culprit: it throws for {\i} and {\'i} which both work for natbib (on my Overleaf anyway).

retorquere commented 4 years ago

I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect. I've added this testcase but that passes for me.

retorquere commented 4 years ago

As an aside, it's a simple fix for me to add missing diacritics (or other constructs), but {\i} and {\'i} are in my mapping, so I don't currently know which diacritics "Misses some forms of diacritics" refers to.

retorquere commented 4 years ago

BTW, as far as completeness testing goes, I'd suggest testing at least

and optionally

retorquere commented 4 years ago

BTW the BBT parser builds on the astrocite parser, parts of which are by my hand, but the BBT parser will therefore necessarily be slower than astrocite. I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).

larsgw commented 4 years ago

I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect.

It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)

larsgw commented 4 years ago

BTW, as far as completeness testing goes, I'd suggest testing at least

I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense. I get that that makes it a bit of an unfair comparison, especially performance-wise, but I didn't mean this repository as a way of calling people out, just to see if my results were somewhat adequate.

larsgw commented 4 years ago

I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend).

I'm not trying to convince you to switch, either. I knew my old parser was bad and I wanted to see which method of parsing was best for my purposes (PEG.js, nearley-js, rolling my own parser, etc.). There isn't much postprocessing going on, it converts commands to Unicode, concatenates fields, and puts everything into an object. No conversion to CSL though, but it's not an AST and I understand if this is too much pre-processing.

{
  type: String,
  label: String,
  properties: {
    field: "value" // note: this is verbatim, except command -> Unicode
  }
}
retorquere commented 4 years ago

I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense.

Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them. EDTF dates can indeed be done later, but sentence casing and name parsing must be done at the AST level:

It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.

retorquere commented 4 years ago

It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :)

This is now fixed.

retorquere commented 4 years ago

(I tried parsing syntax.bib, but it'd require changes to the astrocite parser, and it isn't valid bib(la)tex it seems; overleaf chokes on it at least)

larsgw commented 4 years ago

Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them.

Right, I mixed that up. It's in #3 (checkbox 3).

sentence casing and name parsing must be done at the AST level

Braces in values are kept for that reason (except around some diacritic commands, as D<span class="nocase">é</span>coret is a bit over the top for {\' e}, in my opinion). Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.

It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list.

True, but I think this is well enough for my intended purposes. I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.

retorquere commented 4 years ago

Braces in values are kept for that reason (except around some diacritic commands, as D<span class="nocase">é</span>coret is a bit over the top for {\' e}, in my opinion).

Sure. But it's more complicated than that; the braces usually, but not always, mean nocase. See https://retorque.re/zotero-better-bibtex/support/faq/#why-the-double-braces for some examples and links to details. And then there's still the point that lists (literal lists and names) can only be properly distinguished at the grammar level. nocase isn't appropriate there.

Bib.TXT needs those too for authors and casing, so that's dealt with in the mapping for the moment.

I don't know what Bib.TXT is BTW.

True, but I think this is well enough for my intended purposes.

Can't argue that of course, but then "complete" doesn't mean a whole lot. But at the least, footnote 5 has been fixed now, unless there's more diacritics I missed.

I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities.

Cool. BTW if name parsing and the meaning of braces (nocase or not) happens inside the parser, and the parser also converts markup (such as superscript, emph, etc), an AST may not be required. But I found it easier to do those by transforming the AST; that's actually what the BBT parser adds to the astrocite parser. The actual grammar is just that of astrocite, although I did add changes to the astrocite parser to be able to parse my test suite.

My parser also adds a simple form of error recovery BTW. The astrocite parser is an all-or-nothing parser. The BBT parser will parse entries one by one and give some info on entries that failed to parse.

larsgw commented 4 years ago

I don't know what Bib.TXT is BTW.

Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.

Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.

The BBT parser will parse entries one by one and give some info on entries that failed to parse.

I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?

retorquere commented 4 years ago

Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference.

I mean... if you're leaning that way, wouldn't TOML or YAML make more sense? At least the more naive parsers (which can sometimes be useful) become trivial.

Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.

It may be that we see the meaning of "values" differently. For a title, HTML markup will mostly do, as long as the actual intent (which is, as noted, non-trivial) comes through. But name-lists and literal-lists are not strings, they're lists of strings, and you can't safely deduce where they're to be broken into parts without passing on the structure.

I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one?

An unclosed open brace will consume all the input after it, yes, but all other errors (also unexpected closing braces) will skip ahead to the first @ it can find and attempt reparsing from that point on, repeatedly, until all input is parsed or consumed this way. So it will generally report and skip the smallest error it can, with the worst case being a single unpaired open brace.

larsgw commented 4 years ago

This mixes tokens (lowercase) and rules (capitalized), but that could be changed as long as there are no naming conflicts.

@book{label,
  title = "{T}est"
}
{
  kind: 'Main',
  loc: {
    start: { offset: 0, line: 1, col: 1 },
    end: { offset: 33, line: 3, col: 2 }
  },
  children: [
    {
      kind: 'Entry',
      loc: {
        start: { offset: 0, line: 1, col: 1 },
        end: { offset: 33, line: 3, col: 2 }
      },
      children: [
        {
          kind: 'at',
          loc: {
            start: { offset: 0, line: 1, col: 1 },
            end: { offset: 1, line: 1, col: 2 }
          },
          value: '@'
        },
        {
          kind: 'dataEntryType',
          loc: {
            start: { offset: 1, line: 1, col: 2 },
            end: { offset: 5, line: 1, col: 6 }
          },
          value: 'book'
        },
        {
          kind: 'lbrace',
          loc: {
            start: { offset: 5, line: 1, col: 6 },
            end: { offset: 6, line: 1, col: 7 }
          },
          value: '{'
        },
        {
          kind: 'label',
          loc: {
            start: { offset: 6, line: 1, col: 7 },
            end: { offset: 11, line: 1, col: 12 }
          },
          value: 'label'
        },
        {
          kind: 'comma',
          loc: {
            start: { offset: 11, line: 1, col: 12 },
            end: { offset: 12, line: 1, col: 13 }
          },
          value: ','
        },
        {
          kind: '_',
          loc: {
            start: { offset: 12, line: 1, col: 13 },
            end: { offset: 15, line: 2, col: 2 }
          },
          children: [
            {
              kind: 'whitespace',
              loc: {
                start: { offset: 12, line: 1, col: 13 },
                end: { offset: 15, line: 2, col: 2 }
              },
              value: '\n  '
            }
          ],
          value: undefined
        },
        {
          kind: 'EntryBody',
          loc: {
            start: { offset: 15, line: 2, col: 3 },
            end: { offset: 32, line: 3, col: 0 }
          },
          children: [
            {
              kind: 'Field',
              loc: {
                start: { offset: 15, line: 2, col: 3 },
                end: { offset: 32, line: 3, col: 0 }
              },
              children: [
                {
                  kind: 'identifier',
                  loc: {
                    start: { offset: 15, line: 2, col: 3 },
                    end: { offset: 20, line: 2, col: 8 }
                  },
                  value: 'title'
                },
                {
                  kind: '_',
                  loc: {
                    start: { offset: 20, line: 2, col: 8 },
                    end: { offset: 21, line: 2, col: 9 }
                  },
                  children: [
                    {
                      kind: 'whitespace',
                      loc: {
                        start: { offset: 20, line: 2, col: 8 },
                        end: { offset: 21, line: 2, col: 9 }
                      },
                      value: ' '
                    }
                  ],
                  value: undefined
                },
                {
                  kind: 'equals',
                  loc: {
                    start: { offset: 21, line: 2, col: 9 },
                    end: { offset: 22, line: 2, col: 10 }
                  },
                  value: '='
                },
                {
                  kind: '_',
                  loc: {
                    start: { offset: 22, line: 2, col: 10 },
                    end: { offset: 23, line: 2, col: 11 }
                  },
                  children: [
                    {
                      kind: 'whitespace',
                      loc: {
                        start: { offset: 22, line: 2, col: 10 },
                        end: { offset: 23, line: 2, col: 11 }
                      },
                      value: ' '
                    }
                  ],
                  value: undefined
                },
                {
                  kind: 'Expression',
                  loc: {
                    start: { offset: 23, line: 2, col: 11 },
                    end: { offset: 32, line: 3, col: 0 }
                  },
                  children: [
                    {
                      kind: 'ExpressionPart',
                      loc: {
                        start: { offset: 23, line: 2, col: 11 },
                        end: { offset: 31, line: 2, col: 19 }
                      },
                      children: [
                        {
                          kind: 'QuoteString',
                          loc: {
                            start: { offset: 23, line: 2, col: 11 },
                            end: { offset: 31, line: 2, col: 19 }
                          },
                          children: [
                            {
                              kind: 'quote',
                              loc: {
                                start: { offset: 23, line: 2, col: 11 },
                                end: { offset: 24, line: 2, col: 12 }
                              },
                              value: '"'
                            },
                            {
                              kind: 'Text',
                              loc: {
                                start: { offset: 24, line: 2, col: 12 },
                                end: { offset: 27, line: 2, col: 15 }
                              },
                              children: [
                                {
                                  kind: 'BracketString',
                                  loc: {
                                    start: { offset: 24, line: 2, col: 12 },
                                    end: { offset: 27, line: 2, col: 15 }
                                  },
                                  children: [
                                    {
                                      kind: 'lbrace',
                                      loc: {
                                        start: {
                                          offset: 24,
                                          line: 2,
                                          col: 12
                                        },
                                        end: {
                                          offset: 25,
                                          line: 2,
                                          col: 13
                                        }
                                      },
                                      value: '{'
                                    },
                                    {
                                      kind: 'Text',
                                      loc: {
                                        start: {
                                          offset: 25,
                                          line: 2,
                                          col: 13
                                        },
                                        end: {
                                          offset: 26,
                                          line: 2,
                                          col: 14
                                        }
                                      },
                                      children: [
                                        {
                                          kind: 'text',
                                          loc: {
                                            start: {
                                              offset: 25,
                                              line: 2,
                                              col: 13
                                            },
                                            end: {
                                              offset: 26,
                                              line: 2,
                                              col: 14
                                            }
                                          },
                                          value: 'T'
                                        }
                                      ],
                                      value: 'T'
                                    },
                                    {
                                      kind: 'rbrace',
                                      loc: {
                                        start: {
                                          offset: 26,
                                          line: 2,
                                          col: 14
                                        },
                                        end: {
                                          offset: 27,
                                          line: 2,
                                          col: 15
                                        }
                                      },
                                      value: '}'
                                    }
                                  ],
                                  value: 'T'
                                }
                              ],
                              value: '{T}'
                            },
                            {
                              kind: 'Text',
                              loc: {
                                start: { offset: 27, line: 2, col: 15 },
                                end: { offset: 30, line: 2, col: 18 }
                              },
                              children: [
                                {
                                  kind: 'text',
                                  loc: {
                                    start: { offset: 27, line: 2, col: 15 },
                                    end: { offset: 30, line: 2, col: 18 }
                                  },
                                  value: 'est'
                                }
                              ],
                              value: 'est'
                            },
                            {
                              kind: 'quote',
                              loc: {
                                start: { offset: 30, line: 2, col: 18 },
                                end: { offset: 31, line: 2, col: 19 }
                              },
                              value: '"'
                            }
                          ],
                          value: '{T}est'
                        }
                      ],
                      value: '{T}est'
                    },
                    {
                      kind: '_',
                      loc: {
                        start: { offset: 31, line: 2, col: 19 },
                        end: { offset: 32, line: 3, col: 0 }
                      },
                      children: [
                        {
                          kind: 'whitespace',
                          loc: {
                            start: { offset: 31, line: 2, col: 19 },
                            end: { offset: 32, line: 3, col: 0 }
                          },
                          value: '\n'
                        }
                      ],
                      value: undefined
                    }
                  ],
                  value: '{T}est'
                }
              ],
              value: [ 'title', '{T}est' ]
            }
          ],
          value: { title: '{T}est' }
        },
        {
          kind: 'rbrace',
          loc: {
            start: { offset: 32, line: 3, col: 1 },
            end: { offset: 33, line: 3, col: 2 }
          },
          value: '}'
        }
      ],
      value: { type: 'book', label: 'label', properties: { title: '{T}est' } }
    }
  ],
  value: [ { type: 'book', label: 'label', properties: { title: '{T}est' } } ]
}
retorquere commented 4 years ago

What is being mixed? I don't understand? This is the AST produced by the new idea parser?

larsgw commented 4 years ago

This is the AST produced by the new idea parser?

Yes.

What is being mixed?

I'm using a tokenizer (moo) which splits up the text into parts like lbrace and at and text based on where it's at in the file. Then, the rules are defined based on those tokens instead of individual characters, which helped a lot with the performance on abstracts for example. However, the AST has both rules (as branches) and tokens (as leaves) with no real distinction except their name and their position in the tree.

retorquere commented 4 years ago

I see. But as far as I can tell, the tokens should be easy enough to filter out, and that should leave a fairly clean nested AST, which I could then inspect and transform.

Can I play with this? I am curious what {Bausch and Lomb} and {{Bausch and Lomb}} would return. From what I see above I suspect I'd get something where I can see the difference between these two ands.

How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.

Error recovery is separate in my parser BTW. If it can be built into the idea parser it will almost certainly be faster, but if not, I could just keep my existing one; the error recovery works by chunking the input into individual entries/strings/comments, then parsing these individually with the astrocite parser, then reassembling the results (a.o. by replacing references to @strings with the AST of those @strings.

larsgw commented 4 years ago

Can I play with this? I am curious what {Bausch and Lomb} and {{Bausch and Lomb}} would return. From what I see above I suspect I'd get something where I can see the difference between these two ands.

I'll push the changes to ast.

How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all.

In principle just by adding files to the test/files/ directory. I updated the test suite (in the ast branch) so it works for a single parser and numerous files instead of many parsers and a few files. You can run npm test to run the parser on every file in test/files/. You can also run

node test/ast.js test/files/single.bib

to get a single file's AST output. Note that those can be pretty long, longer than my terminals scrollback anyway. For the sake of brevity, the updated test suite only prints success on success.

retorquere commented 4 years ago

if I do

node test/ast.js test/files/syntax.bib 

I get

[
  {
    "type": "book",
    "label": "sweig42",
    "properties": {
      "author": "Stefan Swe{\\i}g and Xavier D\\'ecoret",
      "title": " The {impossible} ℡—book ",
      "publisher": " D\\\"ead Poₑeet Society",
      "year": 1942,
      "month": "03"
    }
  }
]

which isn't what I expected. Should this have been the AST?

larsgw commented 4 years ago

Did you run npm run babel to update lib/ first?

retorquere commented 4 years ago

Right, now it gives me the AST.

retorquere commented 4 years ago

It parses most of my test suite files, with these exceptions:

../bibtex/tests/better-bibtex/export/Really Big whopping library.bib
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

../bibtex/tests/better-bibtex/import/Async import, large library #720.bib
Error: invalid syntax at line 64197 col 1:

  @inproceedings{Mills2012a,
  ^

../bibtex/tests/better-bibtex/import/Endnote should parse.bib
SyntaxError: expected "comma", got "label" at line 3 col 11:

      author =
            ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Import Jabref fileDirectory, unexpected reference type #1058.bib
SyntaxError: expected "comma", got "label" at line 33 col 23:

  @Comment{jabref-meta: databaseType:bibtex;}
                        ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Jabref groups import does not work #717.3.8.bib
SyntaxError: expected "comma", got "label" at line 36 col 23:

  @Comment{jabref-meta: databaseType:bibtex;}
                        ^ (Main->Entry)

../bibtex/tests/better-bibtex/import/Maintain the JabRef group and subgroup structure when importing a BibTeX db #97.bib
Error: invalid syntax at line 9242 col 52:

  results for Z~S/CUZO and 7.nO/Cu20 heterojunctions.},
                                                     ^
../bibtex/tests/better-bibtex/import/Some bibtex entries quietly discarded on import from bib file #873.bib
SyntaxError: expected "lbrace", got "label" at line 1954 col 10:

  @Comment Len
           ^ (Main->Entry)

cleanup of the AST will be a bit of work, I'll take a look in the weekend.

larsgw commented 4 years ago

I'll look at the test results this weekend as well.

retorquere commented 4 years ago

One other thing that the chunker adds is optional async BTW. It's not really "background" async, but it will yield to the event loop after every chunk which allows other tasks to interleave with parsing.

retorquere commented 4 years ago

Oh and wrt verbatim fields, mendeley gets this wrong for eg file fields so my parser has an option to choose whether file fields are verbatim or not.

At one time, endnote also exported items without citation keys. There's a ton of real-life crap in my test suite - just because it parses doesn't necessarily mean the meaning is extracted properly.

retorquere commented 4 years ago

BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e

larsgw commented 4 years ago

Oh and wrt verbatim fields, mendeley gets this wrong for eg file fields so my parser has an option to choose whether file fields are verbatim or not.

Good idea, I'll make them configurable when I implement it.

BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e

Cool! I'll add the figures (and/or the test suite) to the repo.

retorquere commented 4 years ago

Another thing (just added to @retorquere/bibtex-parser): only "engl-ish" (english and some variants like usenglish, american, etc) should be sentence cased on import.

retorquere commented 4 years ago

The BBT parser has been updated -- {\emph same} wasn't recognized properly. I didn't think anyone would ever use this, but it's in my test suite. It behaves differently from {\it same} BTW. {\it same} italicizes same, {\emph same} italicizes just the s.

On two of my test files, at least astrocite runs out of memory, where my parser will parse them correctly (if slowly, they're 8.2Mb and 11Mb respectively).

retorquere commented 4 years ago

Does the citation-js parser handle verbatim fields (like url and file) and verbatim commands (\url, \href, probably others)?

A few things recently fixed in the BBT parser that citation-js may not yet be aware of:

  1. $\frac n 2 + 5$ is valid, and equivalent to $\frac{n}{2} + 5$
  2. < and > mean different things depending on whether you're parsing in math mode or text mode.

BBT has its own AST parser now, which is based on a version of astrocite grammar but has seen substantial (and incompatible) changes since.

It still seems strange to me to label parsers "complete" merely because they don't crash. Name parsing, verbatim fields, title-sentence casing, command-argument handling are all crucial parts of parsing bibtex. I'd wager that none of the "complete" parsers will parse {Bausch and Lomb} vs {{Bausch and Lomb}} correctly, or handle $\frac n 2 + 5$ properly.

retorquere commented 4 years ago

Nice on the updated tests! BBT 3.1.20 fixes all non-gimmick tests and some gimmick tests.

What do you think the state of idea-reworked is now? Given how fast it is I may want to build on it, but I'd need to be able to pass my own test suite.

larsgw commented 4 years ago

The main part missing from idea-reworked right now is the actual mapping to CSL or other output formats too. That includes field information as well, such as url and verbatim fields and automatic recognition of list fields. And for that, I need some distinction between natbib and biblatex, as they have minor differences in syntax. Note: I am aware that a lot of this is minor edge cases (apart from the field information).

I have been working on mappings over at the aptly named bibtex-mappings, I don't remember if I linked it before. The repository contains some data text-mined from documentation (the biblatex docs are especially usable for this) to be combined with hand-crafted mappings.

3 is still pretty up-to-date, I have been mainly focused on fixing the test suites and README, and a workaround for the command concatenation gimmick. I'm trying to fully get back into it and sift through the issues and comments soon.

retorquere commented 4 years ago

I understand why you'd want mapping to other objects, but I just want the parsed object (pretty much what _intoFixtureOutput delivers), and I'll take it from there, as I'm targeting specifically conversion to Zotero objects.

The command concatenation gimmick would be pretty difficult to address in my parser, but to me that wouldn't be any kind of priority. It's interesting to see that your parser can deal with it successfully, but it's not something I expect to see in the wild.

3 still has a long list of stuff I absolutely need in the todo list, so I'd have to wait on that. I'm subscribed to the issue, but I won't be notified of edits, just new comments.

larsgw commented 3 years ago

Do you happen to have some documentation for the extended name format? I am working on name parsing now and I did find 3.6 Data Annotations in the BibLaTeX manual but that's slightly different from how the feature fixture you added works.

retorquere commented 3 years ago

I don't have docs handy, no, and maybe I misunderstood it when I built it. What difference do you see?

larsgw commented 3 years ago

Apparently, what you have works but I have not found it in the manual yet. I did find this, on page 82 in http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf:

@MISC{ann1,
    AUTHOR = {Last1, First1 and Last2, First2 and Last3, First3},
    AUTHOR+an = {1:family=student;2=corresponding}
}

But the name-parts are not overwritten by the annotation.

larsgw commented 3 years ago

There's an example of what you implemented here: https://github.com/plk/biblatex/blob/dev/doc/latex/biblatex/examples/93-nameparts.tex

retorquere commented 3 years ago

But the name-parts are not overwritten by the annotation.

Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.

larsgw commented 3 years ago

I updated the feature fixtures to include all the name parts instead of just the last name, and on one I encountered unexpected \u0004 characters in BBT's output. They seem to come from https://github.com/retorquere/bibtex-parser/blob/f41af75fd9350507279b42078d07de1187699455/index.ts#L63-L67, when is that used for specifically? Should it still be there in the output?

larsgw commented 3 years ago

Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself.

I think you're right, still a bit confused about the annotation in the example though. Why would someone annotate specifically the family part of the name with "student"?

retorquere commented 3 years ago

I can't say with certainty, but this looks to me like a synthetic sample meant to show what's possible with annotations, more than an actual sample from an actual annotated bibliography.

retorquere commented 3 years ago

Those 0004 chars should not be in the output, I'll look into that.

larsgw commented 3 years ago

If it helps, I saw it when there were braces in explicit name part values in the extended name format:

@article{test,
  author = {family=Duchamp, given=Philippe, given-i={Ph}}
}
retorquere commented 3 years ago

Thanks, that is fixed in the latest release.

retorquere commented 3 years ago

I'm also tinkering with chevrotain to remove a pass from my parser.

larsgw commented 3 years ago

Cool! I think I might have heard of chevrotain before but I do not recognize the website... the uppercase function names seem familiar though.

retorquere commented 3 years ago

https://github.com/SAP/chevrotain

retorquere commented 3 years ago

I've tried chevrotain, but if your test results are anything to go by, your parser is 2-3 times faster than my lexer alone. I can't replicate your results because npm install fails for me, but clearly I should be looking to use your parser for speed. What's the current state of things? I see that moo is only used "for now", do you intend to remove that dependency?