creationix / jsonparse

A streaming JSON parser written in pure JavaScript for node.js
MIT License
357 stars 63 forks source link

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1) #41

Open richardscarrott opened 4 years ago

richardscarrott commented 4 years ago

We're indirectly using jsonparse via JSONStream to stream in JSON data stored in Google Cloud Storage and we're intermittently seeing the following error:

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1)

99% of the time the data is parsed successfully so I'm guessing it's related to where the chunks of data are split over http -- I believe it could be related to emoji characters or Japanese chars as both exist in our json but I'm struggling to pin point exactly where it's failing.

Is there perhaps a way to log more information re: the string value it failed on?

richardscarrott commented 4 years ago

Okay, I've found the problematic character by logging the buffer before the error is thrown; it's a right double quote and it only happens when at position 0 of a given chunk:

Screenshot 2020-07-06 at 19 16 57

Any idea why this character would be misinterpreted, is it an issue on our side or a bug here?

richardscarrott commented 4 years ago

It looks like this library (and JSONStream) is no longer actively maintained, so for anybody else who is unfortunate enough to run into this issue, I ended up using stream-json which hasn't presented the same problem e.g.

Before:

import _ from 'highland';
import JSONStream from 'JSONStream';

// { data: [{}, {}, {}] }
_(readableStream)
   .through(JSONStream.parse('data.*'))
   .toArray((result) => console.log('DONE', result))

After:

import _ from 'highland';
import { parser } from 'stream-json';
import { pick } from 'stream-json/filters/Pick';
import { streamArray } from 'stream-json/streamers/StreamArray';

// { data: [{}, {}, {}] }
_(readableStream)
    .through(parser())
    .through(pick({ filter: 'data' }))
    .through(streamArray())
    .map(({ value }) => value)
    .toArray((result) => console.log('DONE', result))
cldellow commented 2 years ago

For others who stumble across this but still want to use this library, I think changing https://github.com/creationix/jsonparse/blob/b2d8bc6db4f6be3f276752b3b9f882b1945afede/jsonparse.js#L166-L171 can fix this.

Only emit the new character if the buffer contains at least as many bytes as are remaining in the sequence:

        var toConsume = Math.min(this.bytes_remaining, buffer.length);
        for (var j = 0; j < toConsume; j++) {
          this.temp_buffs[this.bytes_in_sequence][this.bytes_in_sequence - this.bytes_remaining + j] = buffer[j];
        }
        this.bytes_remaining -= toConsume;

        if (this.bytes_remaining === 0) {
          this.appendStringBuf(this.temp_buffs[this.bytes_in_sequence]);
          this.bytes_in_sequence = 0;
        }

My fork is pretty far removed from this one, otherwise I'd publish this in a more useful format. Still, hope it helps someone!