Handling of Invalid sequences

CodeMedic42 commented 7 years ago

Hey,

I'm writing a new tool and your library is a main part of it. Unfortunately how your library handles invalid sequences is causing me issues. I am curious as to how open you are to a small change.

I am creating a plug-in for Atom which will lint a JSON file. The thing is I need to know where in the file an issue is happening at. Since this library tokenizes the white space this becomes very easy to calculate the row/colmun - start/end coords. But in some cases your library throws an error. This happens with something like this

{
   "A": "Bad
}

or

{
   "A": Bad"
}

What I would like to see in these cases is a token with a new type, say for example 'invalid'. The library can stop at this point just like it does now. It just would not throw an error. This would mean that if there is an invalid sequence the last item in the array will always be of type 'invalid'.

As an example the two items above create the following 'invalid' tokens

{ "type": "invalid", "value":"B", "raw":"B" }

{ "type": "invalid", "value":"Bad", "raw":"\"Bad" }

This allows me to mark this spot in the file in Atom as an error. I have already done a small p-o-c to prove this out the above two situations. It was an easy change. I can work to do the rest and create a PR. But I would hate to do that work without checking to see if you are open to changes or suggestions.

Also I can see a config option which will allow someone to indicate if they want an error throw when these items are encountered or not. If true then the library will act as it does today. If false then it will act as I am looking for it to act.

Thoughts?

finnp commented 7 years ago

Hey!

I am actually quite open for this change. I think it would be nice though to have an option if the error should be thrown or not. This way we could also make it a non-breaking change.

Best, Finn

CodeMedic42 commented 7 years ago

Ok I will carve out some time to complete this and write unit tests. It should not take me more than a few days at the most to complete. But with other responsibilities I cannot give dedicated time to this, though I do need to get it completed. I will give you an update when I have a something to show off.

Aside from the design above, I will add a second optional param when executing the lexer. Something like this

const lexer = require('json-lexer');

lexer('{"hello": 1.0}', {
   throwOnInvalid: true || false
});

Do you have a preference for which is the default value? I would suggest 'true'.

finnp commented 7 years ago

Awesome. Yes, I think it would be nice to have the default work as it is.

I think it would be nice if the abort tokens will also include the error message. Haven't really looked in to it, but I guess this would only require a change to the abort function, right?

CodeMedic42 commented 7 years ago

We are both thinking along the same lines. I was moving forward relatively well, I had the abort function modified, until I hit the escape sequence errors. These errors happen inside of a token and basically invalidate PART of that token rather than the whole thing. For the tool I am writing I need to to know the exact location of the error so I can highlight it in the atom editor. What I am going to probably do is add an additional object to each token which fails. This object will have the error message normally thrown as well as a value indicating how my characters in and the length the error. This way as long as you know where the token is in the JSON string then you should know where the error occurs.

I already have code written to figure this out in my other tool. I also see you have asked for help to add this to your library. It might improve my performance if I add this logic your library. But I think to I might take another PR later to add this in. But for now I am just adding in the error handling.

Here is what I am estimating to be the schema of a token instance after I am completed. I am going to use as an example a string which has an invalid sequence "\x".

var example = {
   type: 'string',
   value: null,
   raw: '"This is a bad sequence (\x)"',
   issue: {
      message: 'Invalid escape sequence.',
      start: 25,
      length: 2
   }
}

If the case of the unicode sequence issue I still have to determine what I can do with this. I would like to highlight the sequence AND the code, but I have to see if I can do this. I am trying to not modify too much and use what you have already created as much as I can.

I also will need to figure out what to do if the string has an invalid sequence AND also is missing it's end double quote. I probably will not handle that and just deal with one error at a time. I have a fault trying to handle TOO much. :P

If the lexer cannot even determine the token type then the token which will be created will look like

var example = {
   type: 'unknown',
   value: null,
   raw: '', // Imagine the raw property has an invalid character
   issue: {
      message: 'Unrecognized token.',
      start: 0,
      length: 1
   }
}

If an issue encompass the whole token then we should expect the start to be zero and the length value to equal the length of the characters in the "raw" property.

If a token is valid the issue property will be null and will be exactly how it is today.

I guess that is it. Let me know if you have any questions or issues.

finnp / json-lexer

Handling of Invalid sequences #18