citation-js / bibtex-parser-experiments

Experiments to determine a new BibTeX parser formula for Citation.js -- to be applied to other formats as well
https://travis-ci.com/citation-js/bibtex-parser-experiments/builds
MIT License
4 stars 1 forks source link

Argument commands #17

Open larsgw opened 4 years ago

larsgw commented 4 years ago

The citationjs parser needs to allow for more different kinds of commands, mostly argument commands. Arguments seem to be treated the same always: it either takes in a braced block or the first character of text. Exceptions are math blocks: \url takes in the dollar sign verbatim while \emph does not.

retorquere commented 4 years ago

That's more a difference whether a command parses its argument in verbatim-mode; \url expects one parameter, and parses that in verbatim mode; \href expects two arguments, but parses the first verbatim, and the 2nd normal. \begin{verbatim} ...\end{verbatim} parses everything in that environment verbatim. \verb parses everything until the end of the block it's in verbatim.

There's simply no math in verbatim environments, because the $ is just a character there.

larsgw commented 4 years ago

That's a bit annoying, I was planning to do something like the following:

// constants.js
export const argumentCommands = {
  href (url, text) { return text === url ? text : `${text} (${url})` }
}

// value.js (grammar)
const grammar = new Grammar({
  // ...

  Command () {
    const command = this.consumeToken('command').value

    if (command in constants.argumentCommands) {
      const func = constants.argumentCommands[command]
      const args = []
      let arity = func.length // fun thing

      while (arity-- > 0) {
        this.consumeToken('whitespace', /* optional: */ true)
        args.push(this.consumeRule('Argument'))
      }

      return func(...args)
    } // else...
  },

  // ...
})
retorquere commented 3 years ago

If you retain the full parsed input attached to the tokens while tokenizing, it's possible to decide during this phase how you want to handle the input. Basically, you process the tokens according to their semantic meaning for normal mode, and for verbatim mode, you take the parsed orig text attached to the tokens and string it together.

Don't forget that commands can have arguments in square brackets. I simply ignore them, but for that I do have to parse them.

larsgw commented 3 years ago

I think I might just let the command functions be called as if they're rules in the grammar, i.e. they can decide themselves how to parse their arguments. Perhaps a bit similar to what you're doing, based on what I saw. It feels a bit weird to make it that customisable but I don't think it can lead to code injection or the like.

By the way, I am working on a prototype plugin for @citation-js/plugin-bibtex that extends unicode support with your unicode2latex tables. I don't really want to put an additional 400KB in the default browser bundle so I think an optional plugin to the plugin could work well. I am still working out how to add things like {\\'{}I} but that might be helped by the changes mentioned above.

retorquere commented 3 years ago

From my pov you're making astounding progress.