Partial parsing - Githubissues

canatella commented 7 years ago

Here is an attempt at providing partial parsing support. It's orthogonal to the event parsing branch I proposed before so those changes are not included here.

There are still problems for me with this branch:

7ed92d18a4404d12119b3ace116d3df7f19d48f1: I'm coming from a C programming background, so I'm still not used to the C++ way of doing things. I'd like to modify the stacked value in place but for some reason it's not working so I pop them and push them back again, I suppose there is a better way.
Testing: I did some test locally but I did not modify the test.cpp, on purpose. It's a bit of a mess and I wanted to know how and what you'd want me to add as tests.

I also suppose you'll have other remarks and they'd be welcomed.

smarx commented 7 years ago

Automated message from Dropbox CLA bot

@canatella, thanks for the pull request! It looks like you haven't yet signed the Dropbox CLA. Please sign it here.

canatella commented 7 years ago

This PR is to go along with PR https://github.com/dropbox/json11/pull/83.

We develop mobile applications. We use json11 and djinni to connect to REST apis. We have a lot of data to transfert at first application startup on possibly bad connection (mobile data), and we want to be able to display the data ASAP. This PR allows parsing the json body without having to buffer the entire HTTP response body. As soon as there is data, you can feed the parser with it by calling consume and the parser will continue where it left. Used in conjunction with PR PR https://github.com/dropbox/json11/pull/83, it will trigger event as soon as any JSON value is available. With these two PR, we could then also make the parser drop already parsed code to limit memory usage when parsing large json strings, which is part of my use case. They could also allow parsing an io directly without having to buffer the entire string in memory.

From an architectural point of view, we need to be able to restart parsing from where we left at any point. For that,

we need to keep the full parser state up to date and self contained
we need to make the parsing loop use the state information.

At the api level, this implies that we expose a parser object with the aforementioned state information that the user can manage by itself, and a function to feed data to the parser.

The first commit deals with this api change, it creates a public parser class which hold the current parser implementation as private member and has a consume method.

The next commits are mostly changes needed to store the full parsing state, besides commit https://github.com/dropbox/json11/pull/88/commits/eb177d0053b0b69a3a510d1958b27768de430283 which uses the introduced state information to provide the depth limit check.

Finally, the last commit is the one making the parsing loop explicit.

I'll add some test that feed the parser with one character at a time and some documentation and examples in the header file.

artwyman commented 7 years ago

Okay, I understand your intent, and how this fits together with your other PR. I'm a bit iffy on merging this feature on it's own since it feels somewhat half-baked without the other pieces you describe to make a useful system. I'd be okay with it as a half-way step to a final good state, though, barring any concerns about safety or performance I might have on deeper review.

I'll wait for your next update with docs and examples before doing the full review.

canatella commented 7 years ago

I can continue the patch series with the freeing of already parsed string and the event parsing and the doc and example for everything if you prefer.

artwyman commented 7 years ago

I could go either way. I see the value of bite-sized pieces for me to review, and to built/tested separately. But I also prefer the final public result is something worth using. Possible compromises might be to do multiple stages of review in the same PR (I'm not sure how well GItHub supports this flow) or to merge multiple PRs onto a public branch then merge into master when all the planned work is done. The branch might be the easiest option for us to collaborate, if you want to make this a multi-part project.

My overall philosophy on PR merges/reviews (and on the internal code-reviews I do at Dropbox): Strictly, each public merge should not be a step backwards. No merging code which is broken, messy, unperformant, etc. with promises to fix it later. Less strictly, I prefer if each public merge is a step forward, meaning no half-baked features which aren't really ready for use yet, but there can be exceptions to this. The point so ensure that users who pull for the first time don't get a broken/confusing intermediate state, and that the code isn't left in the bad state if an author gets busy and doesn't finish the next step.

canatella commented 7 years ago

I added some documentation in the README.md

I added two tests where I used the already existing test results and compare them to the result of parsing the stream one char at a time. I also feeded a big (+/- 300) corpus of various valid json object to it from https://github.com/jdorfman/awesome-json-datasets without problems.

I added some other fixes detected by testing parsing json with comments enabled.

This patch set can be applied on his own has it does not change anything for current users: api and behaviour are the same except for the chunked parsing addition. So you could either apply it to master where it would get a bit of exposure, or apply it to another branch. I should be able to provide fixes in cases of trouble in the next weeks but I think I won't have the time to update the event parsing PR and the freeing of already parsed data quickly as I have more urgent stuff now.

Speaking of which, what would be the best way to free up the already parsed strings, and at which point would it be best to execute ? I could see the use for a circular buffer here but there are no implementation in std, and it might add too much complexity. Other solution I see is using a list of string, that way we can easily drop parsed strings, or replace the string at some point with a substring, but that would entail a lot of memory operations I think.

artwyman commented 7 years ago

Some high-level feedback after the deeper (but still not complete) code-review. It seems like your long-term goal is a good one, but the usability of this current step on its own isn't clear yet. It also seems that you've already diverged quite far from the original json11 code (it's not a simple 1:1 translation which is obviously correct, for instance), and will have gone far further down that path by the time you're done. In particular you're moving away from the "tiny library" descriptor which the README starts with.

By the time you're done I feel the library you'll be left with will be a newly-written library inspired by json11, so I wonder what the value is of calling it json11 at that point, vs. forking off and publishing your own thing under another name. I don't feel like I'll give much value as gatekeeper on your efforts, given I/Dropbox isn't motivated to use them. I want to encourage you to think about just going your own way and building the advanced parsing engine you want without worrying about what fits in with our needs.

One possible more limited integration would be if you might want to provide a more advanced parser which produces json11 objects as output, rather than replacing the original parser outright. In that case your parser could be a separate add-on library of its own, or maybe eventually a separate optional class/file in the json11 repo. If there are aspects of the json11 data which hold you back from doing so (like a missing interface for building the inner value types) I would be open to PRs which make json11 more open to replaceable parsers. I.e. make small changes to json11 to enable you and others to build on top of it freely, rather than making large changes to json11 to do exactly what you want.

canatella commented 7 years ago

Thank you for the review. I have a bit more time now but before going further, I want to say that if you don't want to pursue on this, that's no problem for me, the code is there and it's working for my use case. I wanted to share back the code but if you feel it's a burden for you, I can just keep it in my repository. I just don't want to end up going through all the review process to finally have the PR rejected.

So to make things clear, I'm willing to go forward with the review process if it's going to be accepted and from your previous comment it seems that it's more for you to say that.

Oh, and I do understand your point of view of course, I agree it's a lot of changes you mostly don't need, so no problem for me if you say we stop here ;)

artwyman commented 7 years ago

I'd be interested to hear from the original author @j4cbo on his interest in such a submission. From my point of view, Dropbox doesn't have a need for this, so I'd have to take on the CR and validation of this as a side project, and if this is going beyond Dropbox territory into community/side-project territory I feel like @j4cbo has more ownership here than I do.

dropbox / json11

Partial parsing #88