chadaustin / sajson

Lightweight, extremely high-performance JSON parser for C++11
MIT License
562 stars 42 forks source link

Parsing failure #7

Closed miloyip closed 9 years ago

miloyip commented 9 years ago

I recently try to add sajson to nativejson-benchmark in this branch.

It was able to parse two test json data, but failed for twitter.json. The error message generated is

Verifying sajson (C++) ... Error (1:1): illegal unprintable codepoint in string

with this code:

class SajsonParseResult : public ParseResultBase {
public:
    SajsonParseResult(document&& d_) : d(std::move(d_)) {}
    document d;
};

// ...
    virtual ParseResultBase* Parse(const char* json, size_t length) const {
        (void)length;
        SajsonParseResult* pr = new SajsonParseResult(parse(literal(json)));
        if (!pr->d.is_valid()) {
            std::cout << "Error (" << pr->d.get_error_line() << ":" << pr->d.get_error_column() << "): " << pr->d.get_error_message() << std::endl;
        }
        return pr;
    }

I am pretty sure the json is valid, as it is obtained from twitter's API, and also passed more than a dozens of parsers.

BTW, I am the author of RapidJSON :)

chadaustin commented 9 years ago

Hi! Thanks for the bug report. :) Indeed, it was valid. The latest sajson fixes the bug.

Thanks for RapidJSON, by the way. We use it at IMVU, and while I have a few things I don't particularly like about the API, you set a new high bar for everyone.

miloyip commented 9 years ago

I have verified the fix in the benchmark. Thank you. If you have suggestions on RapidJSON, please feel free to drop an issue.

miloyip commented 9 years ago

I just run with the both latest versions:

Benchmarking RapidJSON (C++)
          Parse canada.json          ...  9.763 ms  219.888 MB/s
          Parse citm_catalog.json    ...  5.638 ms  292.159 MB/s
          Parse twitter.json         ...  4.225 ms  142.546 MB/s

Benchmarking sajson (C++)
          Parse canada.json          ... 12.538 ms  171.221 MB/s
          Parse citm_catalog.json    ...  8.717 ms  188.963 MB/s
          Parse twitter.json         ...  2.661 ms  226.328 MB/s

It seems I need to figure out how to improve the performance for twitter.json. :smile:

chadaustin commented 9 years ago

Hah! On my machine / benchmark, (an older version of) rapidjson outperformed sajson with twitter.json. :) I'd post numbers but I don't have access to that VM at the moment.

I've found they tend to go back and forth depending on the CPU and workload... which probably means there's room to improve both parsers. :)

miloyip commented 9 years ago

One of my intention for making the benchmark is to learn from others and make improvements together. I just learn about __builtin_expect from your code. Is it really beneficial?

chadaustin commented 9 years ago

It depends on the CPU (branch predictor and whatever), compiler, and data set, but sometimes it can be very beneficial!

In particular, in this particular context in a different project ( https://github.com/chadaustin/buffer-builder/blob/master/cbits/buffer.cpp#L82 ) it was a substantial win.

I suspect it won because the compiler could then optimize the register usage / spillage for the hot path. You could clearly see in the generated assembly that clang was trying to make the hot path a single sequence of instructions, with only forward jumps for the cold paths. (I believe, by default, forward jumps are predicted not taken, and backward jumps are predicted taken.)