Whitespace is breaking token detection

zxpectre commented 2 years ago

Hi, I'm afraid whitespace is breaking proper token detection, without whitespaces this works. I am missing some option setup in here?

Works:

jsonToObj(replaceAll('{"hello": {"FOO": {"world": 1234}}}',"FOO",cache.foo));

Fails:

jsonToObj(replaceAll('
    {"hello": {
        "FOO": {
            "world": 1234
            }
        }
    }',"FOO",cache.foo));

producing a last compact token of ',"FOO",cache.foo));

Options:

    global.sparser.options={
        ...(global.sparser.options||{}),
        source:str,
        language:"javascript",    
        lexer:"script",    
    }

panoply commented 2 years ago

Hey, so you are parsing JSON and internally Sparser will overwrite some global options when dealing with such a language, (for example, wrap limit will be reset to 0) so be aware of this.

Looking from your example, you are asserting newlines in a string value, which is not going to work. Simply use a template literal, eg:

jsonToObj(replaceAll(`
    {"hello": {
        "FOO": {
            "world": 1234
            }
        }
    }`,"FOO",cache.foo));

Lastly, Sparser is no longer maintained. I don't know your exact use cases, but if you don't require diffing and just want to the data structures then maybe take a peek at my hard forked variation Prettify which leverages the powerful Sparser under the hood. It's still a WIP but might help you.

zxpectre commented 2 years ago

Ty for the reply @panoply , im parsing js-like scripts like the one I shared, not just JSON. Wrapping text a la "template literals" is managed by my code using the " ' " token.

So I'm expecting to find js mixed with JSON on my inputs.

Prettify looks promising! I will check on it once you officially release it :)

panoply commented 2 years ago

No problems @zxpectre happy to help!

Can you submit a detailed issue to Prettify for me (with detailed code sample/example). I will be doing some work on the script lexer this week and it would be nice to find out what is causing the issue in order to prevent it from occurring in other use cases and with some luck bring it up to a stable enough level where you can use it in your project.

panoply commented 2 years ago

@zxpectre I must of read your issue incorrectly, I see now that you are parsing the entirety of:

jsonToObj(replaceAll('
    {"hello": {
        "FOO": {
            "world": 1234
            }
        }
    }',"FOO",cache.foo));

I assumed you were only parsing the contents of replaceAll - This should not be too difficult to fix and likely occurring in the wrap logic. Definitely forward it through to Prettify and I'll ensure to apply a patch.

zxpectre commented 2 years ago

I would really appreciate if you could cover my use case as I'm sure this can help everybody, this are very generic needs btw.

I'm on a hurry and using sparser right now, but I could migrate if prettify does a nice job for us!

Can I ask you to share the output of your method prettify.parse(source: string): ParseTree on a script like my shared lines?

I will try to make a detailed issue if the output is handy for me. I like the idea of returning a tree, sparser has some limitations that compliicates things when trying to nest nodes correctly (mixes global and local scopes on end tokens sometimes so is hard to nest recursively)

panoply commented 2 years ago

Prettify will return an almost identical structure as its using Sparser under the hood (but with various bug fixes and some improved handling across the board). Don't get to married to the naming convention of ParseTree the data structures are still identical. Here is the structure returned in the code sample:

{
  begin: [
    -1, -1, 1, 1, 3,  3,  5,
     5,  5, 8, 8, 8, 11, 11,
    11, 11, 8, 5, 3
  ],
  ender: [
    -1, -1, -1, -1, -1, 17, 17,
    17, 16, 16, 16, 15, 15, 15,
    15, 15, 16, 17, -1
  ],
  lexer: [
    'script', 'script', 'script',
    'script', 'script', 'script',
    'script', 'script', 'script',
    'script', 'script', 'script',
    'script', 'script', 'script',
    'script', 'script', 'script',
    'script'
  ],
  lines: [
    0, 0, 0, 0, 0, 1, 0,
    0, 1, 2, 0, 1, 2, 0,
    1, 2, 2, 2, 0
  ],
  stack: [
    'global', 'global', 'method',
    'method', 'method', 'method',
    'object', 'object', 'object',
    'object', 'object', 'object',
    'object', 'object', 'object',
    'object', 'object', 'object',
    'method'
  ],
  token: [
    'jsonToObj',
    '(',
    'replaceAll',
    '(',
    "'\n",
    '{',
    '"hello"',
    ':',
    '{',
    '"FOO"',
    ':',
    '{',
    '"world"',
    ':',
    '1234',
    '}',
    '}',
    '}',
    `',"FOO",cache.foo));\n`
  ],
  types: [
    'word',   'start',    'word',
    'start',  'string',   'start',
    'string', 'operator', 'start',
    'string', 'operator', 'start',
    'string', 'operator', 'number',
    'end',    'end',      'end',
    'string'
  ]
}

The defect occurs at the '," character walk which occurs because Sparser assumes an unterminated string, likely because a newline character proceeds the initial single quotation character. The problem here is that sparser is behaving correctly, because newlines cannot be contained in JavaScript quotation characters, any parser will fail on it. For example, see this flems

I could introduce a rule for this, but I'd personally rather not allow invalid syntax pass through.

Unibeautify / sparser

Whitespace is breaking token detection #105