std/jsonc seems to strip comments

bjesuiter commented 1 year ago

Describe the bug

esm.sh is currently unable to update an existing deno.jsonc file. See: https://github.com/esm-dev/esm.sh/issues/686. While implementing, it came to light, that the jsonc parser in the deno std lib seems to discard comments. (see this issue, comment by farsightsoftware: https://github.com/esm-dev/esm.sh/issues/687)

I've not checked it myself yet, but I asked farsightsoftware to add an example to this issue.

The main usecase for this std lib is currently updating the deno.jsonc file, so it would make it a lot easier for all tools dealing with the deno.jsonc file if it would preserve comments automatically.

Steps to Reproduce

TODO

Expected behavior

The jsonc parser output is able to be stringified again and produce identical output to the original jsonc input.

In diagram:

jsonc file
=> jsonc parser
=> some intermediate representation
=> jsonc stringifier
=> a file identical to the original jsonc file (especially with comments)

Environment

TODO

farsightsoftware commented 1 year ago

@bjesuiter I'm not sure what you mean by an example?

From https://github.com/esm-dev/esm.sh/issues/687#issuecomment-1647591989:

does the parser of the std jsonc module discard comments or the stringifier?

The parser. The comment on the *#tokenize method is "Split the JSONC string into token units. Whitespace and comments are skipped." (my emphasis). In fact, as far as I can tell, the standard parser just parses, it doesn't have a stringifier. The parser returns a JavaScript value, not an AST, so there's nothing that could represent a comment node. I haven't gotten deeper into it than that (I'm also short on time, particularly this week). I stopped once I'd found a module that would let us insert things into the JSONC file while preserving comments (though not perhaps as conveniently as one might like).

bjesuiter commented 1 year ago

Thanks @farsightsoftware for the explanation. I wasn't aware that the jsonc package in deno std does not contain a stringifier, but only a parser. I simply assumed, it would also have a stringifier.

After thinking about it, I also realized the problem with returning a JS Object vs. an AST. Maybe this jsonc package was only intended to be used as a means for reading jsonc config files easily.

However, I find it a big roadblock for developers when this lib is not able to reconstruct the file correctly with comments.

I have a conference this week, so I'd add this topic to my list of things to tackle the weeks after. The original issue with esm.sh should be solveable without this.

But I think, the std lib should provide a correct implementation for stringifying jsonc in general, so I'd like to keep this issue open.

mmuddasani commented 1 year ago

Has there been any status update or progress on this bug lately? I find it to be quite an annoying developer experience with ESM not yet doing the right thing for updating my deno.jsonc automagically.

So much so that I would love to design/prototype/contribute a PR if nobody is actively working on this now!

I am crudely thinking of something like a ParseOption to slightly modify the return type of parse (or else to make a new function called e.g. parseTree), such that the returned type includes all details necessary to faithfully stringify the JSON value while reintroducing the comments. (In addition, the returned type could serve as an API for subsequent changes and/or stringification.)

bjesuiter commented 1 year ago

@mmuddasani I like this idea! I don't know though whether deno core team would accept the pr. but we should try!

Step 1: we should look for a standard format for representing parsed jsonc as a concrete syntax tree, so that we avoid creating a new proprietary format.

Step 2: adding this parseTree function you proposed (I find it more clear than adding a parameter to the parse function, since you'd have to use a union return type for the parse function otherwise)

I cannot write it this week, but if you're interested, I'd do a review of the PR! :)

mmuddasani commented 1 year ago

Microsoft's jsonc module that @farsightsoftware found appears incredibly comprehensive. Someone has even ported it as a third-party Deno module: https://deno.land/x/jsonc

It has an MIT license, so I'm wondering if Deno's standard library can leverage the existing implementation, with appropriate copyright attribution.

Alternatively: does parsing JSONC have any reason to belong in the standard library, compared to folks reusing Microsoft's implementation?

If backwards compatibility is important, and if there is reason for Deno to keep support, then merging the two APIs via method overloading seems feasible, despite the existing divergent parse signatures:

Deno: parse(text: string, unnamed 1?: ParseOptions): JsonValue
Microsoft: parse(text: string, errors?: {error: ParseErrorCode;}[], options?: ParseOptions): any;

Aside: Deno's ParseOptions is currently a strict subset of Microsoft's ParseOptions

kt3k commented 1 year ago

@mmuddasani @bjesuiter Thanks for considering jsonc support with retained comments.

Microsoft's jsonc module that @farsightsoftware found appears incredibly comprehensive. Someone has even ported it as a third-party Deno module: https://deno.land/x/jsonc

Microsoft's jsonc module looks too complex to me. It looks exposing too much details about internals, such as scanner, parser, visiter, etc. I don't think the users need to access such lower level primitives. Also that is not aligned to other std modules such as yaml, toml, etc.

If the goal here is to keep the comments, then how about following npm:comment-json? The API looks much simpler than Microsoft's npm:jsonc-parser, and it also seems having enough download counts.

farsightsoftware commented 1 year ago

@kt3k Nice! The lib I found was just the first one I found for the limited purpose I was looking. comment-json looks way better in the general case (and also for that limited purpose). Very clever to use Symbols for keeping track of comments, since JSON doesn't have them.

bjesuiter commented 1 year ago

I agree that the api of comment-json is very nice & simple to understand. (As long as you don't need to dig deeper into the representation). But I agree, using Symbols seems smart for me in this case, since we avoid a complete parseTree.

@kt3k do you agree with "importing" this lib into deno std or do you simply propose usage of this lib?

mmuddasani commented 1 year ago

I imagine the only downside with incorporating the API of npm:comment-json as-is into deno std is the pollution of the global Symbol registry. Symbols are a very neat idea, and I imagine they can still be effectively used without resorting to Symbol.for.

I also imagine it would be re-implemented, to avoid its external dependencies like npm:esprima.

kt3k commented 1 year ago

do you agree with "importing" this lib into deno std or do you simply propose usage of this lib?

Because we have npm: specifier now, we recommend using npm:comment-json of course, but we are also open to including npm:comment-json-like enhancement of std/jsonc module.

If we include features like npm:comment-json in std, we probably would like to extend existing std/jsonc implementation instead of replacing it with ported version of npm:comment-json

cc @ayame113 Do you have any opinion or advise to this issue?

ayame113 commented 1 year ago

I'm in favor of including features like npm:comment-json to std.

I think npm:comment-json's parse and stringify are consistent with the design of the built-in JSON.parse/stringfy, are easy to use, and are worth including in std. I think the majority of use cases where we want to preserve and parse comments is to make changes to a configuration file and write them back, so I think APIs like assign and CommentArray should also be included.

If you decide to add these to std...

If possible, it is better to avoid the parsing results listed in the special cases section of the documentation. When you convert a primitive value to an object, the return value of the typeof operator becomes "object". This can cause bugs when narrowing types in TypeScript.

console.log(typeof "hello world"); //=> string
console.log(typeof new String("hello world")); //=> object

I think there is room for consideration as to how to save comment information using symbols. The following format may make the internal implementation of assign and CommentArray easier.

// object is parsed like...
{
  key: "value",
  [Symbol.for("comments")]: {
    key: {
      before: {...},
      after-prop: {...},
      after-colon: {...},
      after-value: {...},
      after: {...},
    },
  },
};
// array is parsed like...
[
  "foo",
  "bar",
  "baz",
  [Symbol.for("comments")]: [
    {
      before: {...},
      after-value: {...},
      after: {...},
    },
    ...
  ]
];

mmuddasani commented 1 year ago

@ayame113 I am in agreement with both of your suggestions. I would also propose (crudely atm) something as follows:

export const comments_api: unique symbol = Symbol("comments api")

export interface CommentsApi {
  // methods to read/modify/write JsoncValue and consistently maintain comments
  // including order, before/after, maybe whitespace indentation, etc.
}

export type JsoncValue =
  | string
  | number
  | boolean
  | null
  | (
    (
      | JsoncValue[]
      | { [key: string]: JsoncValue | undefined }
    ) & { [comments_api]: CommentsApi }
  )

For fun, I'll be prototyping an API. I'll keep an open eye for additional feedback / suggestions along the way, and for anyone else who wants to make it happen!

If I can't come up with anything more friendly than the assign API, then it may be reasonable enough to model.

mmuddasani commented 1 year ago

I discovered that deno fmt also loses some JSONC comments! I made minimal reproducible examples here. For instance, compare unformatted object with formatted object. This bug is probably for a different Deno codebase; I'm not familiar enough to know which one.

In other news, I've so far been examining the types of comments that npm:comment-json annotates. My intuition tells me that there is room to simplify the variety. The above repo snapshots my current line of thought.

mmuddasani commented 1 year ago

@kt3k @ayame113 Is it possible and/or worthwhile to implement the core business logic in Rust as part of the deno CLI, while somehow having Deno standard library expose a convenient Javascript API?

Pros

Formatting is closely related to the read/modify/write use case
- IIUC, deno fmt uses dprint/dprint-plugin-json for JSONC files
- The latter can benefit from sharing AST logic with the former
- Reduces duplication of business logic (and thereby code maintenance cost)
- The latter can benefit from potential feature additions to the former
- e.g. allow_loose_object_property_names
Consistency enhances developer experience
- For ambiguously written comments, the formatter would either fail, or else make an opinionated choice (and optionally inform/warn user)
- The read/modify/write use case can follow suit
- Example ambiguity (e.g. as an inlined value of an object/array):
- ```
{ /* description of whole object */ "k0": "v0", "k1": "v1" }
```
- ```
{ /* description of upcoming key */ "k0": "v0", "k1": "v1" }
```
- Although a bit contrived, probably not too rare to encounter similar situations

Cons

Does this require permissions like deno run --allow-ffi?
- For security, is it better to avoid Deno standard library from assuming it has a default permission on FFI?
Probably not browser-compatible?

kt3k commented 1 year ago

Core team is not in favor of exposing jsonc related features as Deno CLI feature. I think using Rust is not an option for this issue.

mmuddasani commented 1 year ago

Is the unicodeSets feature (v-mode RegExp) allowed in Deno standard library? I don't see any current usages from a rudimentary code search.

At the very least, I've developed an appreciation for parsing complexity and can follow along better with the current tokenizer logic.

Since last week, I've discovered how to tokenize text input using modern Unicode-aware regexes. For example, it takes less than 400 characters of code to parse JSONC files when allowTrailingComma = false! Short (non-production) example:

const get_jsonc_tokenizer = () =>
  /([\p{ID_Continue}.+\-]+|[[\p{Pattern_Syntax}]--[\/\x22.+\-]]|[\r\n \t]+|\x22[^\x22\\]*(?:\\(?:$|[^])[^\x22\\]*)*(?:$|\x22))|\/(?:\/[^\r\n]*(?:$|\n|\r\n?)|\*[^*]*(?:$|\*+(?:$|\/|[^*\/][^*]*))+)?/gvy

function parse(text: string, { allowTrailingComma = false as const } = {}) {
  return JSON.parse(text.replaceAll(get_jsonc_tokenizer(), "$1"))
}

(Caveat: When parsing fails due to SyntaxError, positions inside error message do not correspond to positions of the original text.)

And an example for tokenizing input:

function* tokenize(text: string) {
  let match
  for (match of text.matchAll(get_jsonc_tokenizer())) yield match[0]
  const position = match ? match.index! + match[0].length : 0
  if (position !== text.length) {
    const token = text[position]
    const error = `Unexpected token ${token} in JSONC at position ${position}`
    throw new SyntaxError(error)
  }
}

How It Works

The regexp loosely considers a JSONC token to be the largest non-empty prefix of one of these:

ID-ish: [\p{ID_Continue}.+\-]+ in Unicode-aware mode (either 'u' or 'v' flag)
- e.g.: -3.14e+1 , null , gibberish , 127.0.0.1 , etc.
- (update: plus sign can appear in numbers)
Punctuation: [[\p{Pattern_Syntax}]--[\/\x22.+\-]] with UnicodeSet notation (only 'v' flag)
- e.g.: comma , bracket , ~~plus sign~~
- (update: plus sign can appear in numbers)
- not e.g.: double quote , minus sign , plus sign
Whitespace: [\r\n \t]+
String: \x22[^\x22\\]*(?:\\(?:$|[^])[^\x22\\]*)*(?:$|\x22)
Comment: / followed by either:
- rest of line comment: \/[^\r\n]*(?:$|\n|\r\n?)
- rest of block comment: \*[^*]*(?:$|\*+(?:$|\/|[^*\/][^*]*))+

When combined with sticky mode, matchAll will tokenize the input and stop either if no input remains or if the next character is not vetted by the regexp.

The example regexp was designed to vet more identifier-like tokens, numbers, punctuation, and string escapes than JSON allows. Relying on JSON.parse will further vet the illegal syntax.

The example regexp was also designed with a single capture group for all non-comment tokens, allowing replaceAll to strip away all comments.

(Side note: Using \x22 indirectly for " prevents VSCode syntax highlighting from looking inside v-mode.)

mmuddasani commented 1 year ago

To avoid comment loss by default, perhaps the below scenarios provide a sensible default mode. In order to assess for any unforeseen ambiguity/unknowns, I will try to prototype an API and implementation.

Perhaps the MVP should not care about preserving whitespace formatting beyond what is consistent with stringify indentation, especially if it is infeasible to do so. The default priority would be to preserve relative order as well as original comment content.

If a client program cares enough about whitespace formatting, then the client program could either:

follow up with an existing formatting API on behalf of the config owner
let the config owner manually curate the result and/or run a formatting tool such as deno fmt

Scenario: Allow deleting/replacing hereditarily uncommented values without a trace.

Example before:

{ "k.keep": { "k.old": "v.old", }, "k.delete": "v.delete" }

Example after modifications (whether done via a replace/update/rename):

{ "k.keep": { "k.new": "v.new", } }

Scenario: Allow replacing/updating/renaming comment-containing values by preserving relevant portions as documentation.

Example before:

{"k.keep":/*keep.start*/{/*a*/"k.old"/*b*/:/*c*/"v.old"/*d*/,/*e*/},//keep.end
/*f*/"k.delete"/*g*/:/*h*/"v.delete"/*i*/,/*j*/}

Example after a kept key is updated via value replacement:

{
  "k.keep":
  // /*keep.start*/{/*a*/"k.old"/*b*/:/*c*/"v.old"/*d*/,/*e*/},//keep.end
  { "k.new": "v.new", },
  // /*f*/"k.delete"/*g*/:/*h*/"v.delete"/*i*/,/*j*/
}

Example after a kept value is updated: individual key-value pairs are deleted, new individual key-value pairs are introduced:

{
  "k.keep":/*keep.start*/{
    // /*a*/"k.old"/*b*/:/*c*/"v.old"/*d*/,/*e*/
    "k.new": "v.new",
  },//keep.end
  // /*f*/"k.delete"/*g*/:/*h*/"v.delete"/*i*/,/*j*/
}

Example after a kept value has its keys and/or values renamed (i.e. to preserve semantic placement of associated comments):

{
  "k.keep":/*keep.start*/{/*a*/"k.renamed"/*b*/:/*c*/"v.renamed"/*d*/,/*e*/},//keep.end
  // /*f*/"k.delete"/*g*/:/*h*/"v.delete"/*i*/,/*j*/
}

denoland / std