Add VelocyPack to the native JSON benchmark

jsteemann commented 8 years ago

The repository https://github.com/miloyip/nativejson-benchmark contains a benchmark suite for various C/C++-based JSON parsers and generators. It would be nice to get VelocyPack into that list so its performance can be compared to other parsers/generators easily.

dsonet commented 8 years ago

@jsteemann Also does anyone have take a look at http://rapidjson.org? It's extremely fast, memory-friendly, small, and implemented JSON Schema, Pointer, under the MIT license.

ColinH commented 6 years ago

After hacking around with velocypack, the nativejson benchmark, and our taoccp/json library for a few hours, here are a few preliminary results. Note that, for now, I always chose the easiest way, in particular the Builder and Parser classes are always default constructed without options.

Besides pure velocypack and pure taocpp/json there are also results for a combination of the two, the taocpp/json parser with a velocypack Builder (similar to the combination of our taocpp/json parser with the nlohmann/json value class that is already part of the nativejson benchmarks).

In the conformance section, taocpp/json achieves a 100% overall score, velocypack 94%, and the combination 97%. The improvement from velocypack to the combination is due to taocpp/json's better parsing of doubles. The 3% missing for the combination are due to some failures in the roundtrip tests which might or might not be real issues.

In the performance section, taocpp/json, velocypack and the combination all are equally fast in the stringify tests.

And now the number you've probably been waiting for, on my laptop the overall parsing benchmark results are 35ms for taocpp/json, 52ms for velocypack, and 26ms for the combination. taocpp/json uses std::map and std::vector for objects and arrays, respectively, which makes it unsurprising that the combination is faster than taocpp/json on its own.

jsteemann commented 6 years ago

@ColinH : thanks for your work on this. We did not find time to add velocypack to the nativejson-benchmark ourselves yet. Is there a fork of any of the repositories (nativejson-benchmark or velocypack) that contains your current state? I would be interested in looking at the conformance test failes and fix them, and also run a bit of the performance benchmarks myself so we can try to optimize based on that.

ColinH commented 6 years ago

@jsteemann It's still work in progress but if you send an email to github@colin-hirsch.net I can share some more details.

ColinH commented 6 years ago

One thing that is ready to be shown is the adapter from the taocpp/json events API to the velocypack Builder, that's all the code you need to connect our parser to your in-memory representation.

struct to_velocypack_events
{
   arangodb::velocypack::Builder builder;

   void add( const arangodb::velocypack::Value & v )
   {
      if ( m_member ) {
         builder.add( m_key, v );
         m_member = false;
      }
      else {
         builder.add( v );
      }
   }

   void null()
   {
      add(arangodb::velocypack::Value(arangodb::velocypack::ValueType::Null));
   }

   void boolean( const bool v )
   {
      add(arangodb::velocypack::Value(v));
   }

   void number( const std::int64_t v )
   {
      add(arangodb::velocypack::Value(v));
   }

   void number( const std::uint64_t v )
   {
      add(arangodb::velocypack::Value(v));
   }

   void number( const double v )
   {
      add(arangodb::velocypack::Value(v));
   }

   void string( const std::string& v )
   {
      add(arangodb::velocypack::Value(v));
   }

   void begin_array()
   {
      add(arangodb::velocypack::Value(arangodb::velocypack::ValueType::Array));
   }

   void element()
   {
   }

   void end_array()
   {
      builder.close();
   }

   void begin_object()
   {
      add(arangodb::velocypack::Value(arangodb::velocypack::ValueType::Object));
   }

   std::string m_key;
   bool m_member = false;

   void key( const std::string& v )
   {
      m_key = v;
      m_member = true;
   }

   void key( std::string&& v )
   {
      m_key = std::move( v );
      m_member = true;
   }

   void member()
   {
   }

   void end_object()
   {
      builder.close();
   }
};

jsteemann commented 6 years ago

@ColinH : thanks for the update. If it's still work in progress, then I prefer waiting until you declare it finished or stable and then try your "official" fork. If in the meantime you find any conformance errors in velocypack that block you from making progress, just let us know so we can fix them. Thanks again for your work on this.

ColinH commented 6 years ago

Here are the roundtrip tests that both velocypack and the combination of the taocpp/json parser with velocypack fail. The third one should, in my opinion - see also my comments in the nativejson benchmark issue linked above - not really be considered a failure since it is an equivalent representation.

## 4. Roundtrip

* Fail:
~~~js
[0.0]
~~~

~~~js
[0]
~~~

* Fail:
~~~js
[-0.0]
~~~

~~~js
[-0]
~~~

* Fail:
~~~js
[1.7976931348623157e308]
~~~

~~~js
[1.7976931348623157e+308]
~~~

Summary: 24 of 27 are correct.

Keep in mind that most libraries do not achieve 100% in the conformance tests; currently only RapidJSON in full precision mode, and taocpp/json do.

jsteemann commented 6 years ago

we can potentially add options to the velocypack Dumper for these cases so it can optionally produce the same results. Am I correct in my assumptions that the first two cases the "bug" in velocypack is that it does not emit a ".0" for real values that happen to be safely representable as integers, and that the latter "bug" is about velocypack returning the "+" in the scientific notation whereas the benchmark test considers it optional? Is there anything else that would need adjustment in order to achieve 100% conformance in that test?

ColinH commented 6 years ago

Correct, these three small things are the only issues in the roundtrip conformance. There are more issues with double conformance, if you drop me a line I can tell you how we fixed them. As often the case, as soon as floating point is involved things are rather complicated (unless you don't care about precision).

## 2. Parse Double

* `[0.017976931348623157e+310]`
  * expect: `1.7976931348623157e+308 (0x0167FEFFFFFFFFFFFFF)`
  * actual: `0 (0x0160)`

* `[10141204801825834086073718800384]`
  * expect: `1.0141204801825834e+31 (0x016465FFFFFFFFFFFFF)`
  * actual: `1.0141204801825833e+31 (0x016465FFFFFFFFFFFFE)`

* `[10141204801825835211973625643008]`
  * expect: `1.0141204801825835e+31 (0x0164660000000000000)`
  * actual: `1.0141204801825833e+31 (0x016465FFFFFFFFFFFFE)`

* `[10141204801825834649023672221696]`
  * expect: `1.0141204801825835e+31 (0x0164660000000000000)`
  * actual: `1.0141204801825833e+31 (0x016465FFFFFFFFFFFFE)`

* `[5708990770823838890407843763683279797179383808]`
  * expect: `5.7089907708238389e+45 (0x016496FFFFFFFFFFFFF)`
  * actual: `5.7089907708238433e+45 (0x0164970000000000003)`

* `[5708990770823839524233143877797980545530986496]`
  * expect: `5.7089907708238395e+45 (0x0164970000000000000)`
  * actual: `5.7089907708238433e+45 (0x0164970000000000003)`

* `[5708990770823839207320493820740630171355185152]`
  * expect: `5.7089907708238395e+45 (0x0164970000000000000)`
  * actual: `5.7089907708238433e+45 (0x0164970000000000003)`

* `[100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]`
  * expect: `1e+308 (0x0167FE1CCF385EBC8A0)`
  * actual: `9.9999999999999981e+307 (0x0167FE1CCF385EBC89F)`

Summary: 58 of 66 are correct.

jsteemann commented 6 years ago

Oh yes, if you fixed these ones in your library already and have some hints, that should definitely save us some work! Can you point me at your floating point parser/builder in your library (I guess there is one)? I think I can use that as a starting point and check what it does differently than ours.

ColinH commented 6 years ago

We use a modified version of the V8 double-conversion library that was adapted to interface more directly with the PEGTL, our C++11 parser library in taocpp that we use for taocpp/json.

I just forked velocypack and replaced the included json parser with taocpp/json, the result can be seen here. This fixes the floating-point issues, and, as mentioned above, doubles the performance of parsing json to a velocypack Builder as measured by the nativejson benchmarks!

For completeness it would be necessary to use taocpp/json for the serialisation to json, too, and to fix the one "TODO" in Parser.h, the error position (which should also be easy, except for some positions in the test cases possibly needing an update).

ColinH commented 6 years ago

Btw, the toBuilderEvents class in my fork, as well as the yet-to-be-written fromBuilder, give many more possibilities; the unified Events interface in the taocpp/json library would allow direct conversion of velocypack from/to our in-memory representation, tao::json::value, and all file formats supported by taocpp/json like UBJSON and CBOR.

(Our in-memory representation is based on the standard containers, which makes inspection and manipulation very easy, as such it is complementary to velocypack.)

jsteemann commented 6 years ago

Sounds good! I will definitely look into the fork in detail, though I potentially won't be able to do it this week. I need to work on some other things first so that we can ship next release. But after that I will have a close look. Thanks!

ColinH commented 6 years ago

FYI, the glue code to produce taocpp/json Events from a Builder has now been added. With the taocpp/json based Parser, and the builderToJson() function in the new header velocypack/Json.h, my fork of velocypack now achieves a 100% conformance score in the nativejson benchmark.

In addition, it makes velocypack compatible with all other taocpp/json Events producers and consumers, you can convert velocypack to/from several other binary formats, several JSON in-memory representations, apply some simple transformations - and of course easily add some more.

ColinH commented 6 years ago

#include "../test.h"

#include "velocypack/vpack.h"

// Nativejson-benchmark integration of ColinH/velocypack,
// an experimental arangodb/velocypack plus taocpp/json.

class StatHandler
{
public:
    StatHandler(Stat& stat) : stat_(stat) {}

    void null() { stat_.nullCount++; }

    void boolean(const bool v) { v ? stat_.trueCount++ : stat_.falseCount++; }

    void number(const std::int64_t) { stat_.numberCount++; }
    void number(const std::uint64_t) { stat_.numberCount++; }
    void number(const double) { stat_.numberCount++; }

    void string(const tao::string_view& v ) { stat_.stringCount++; stat_.stringLength += v.size(); }
    void binary(const tao::byte_view& ) {}
    void begin_array(const std::size_t = 0) {}
    void element() { stat_.elementCount++; }
    void end_array(const std::size_t = 0) { stat_.arrayCount++; }

    void begin_object(const std::size_t = 0) {}
    void key(const tao::string_view& v) { stat_.stringCount++; stat_.stringLength += v.size(); }
    void member() { stat_.memberCount++; }
    void end_object(const std::size_t = 0) { stat_.objectCount++; }

private:
    StatHandler& operator=(const StatHandler&) = delete;

    Stat& stat_;
};

static void GenStat(Stat& stat, const arangodb::velocypack::Builder& builder){
   StatHandler statHandler(stat);
   arangodb::velocypack::builderToEvents(statHandler, builder);
}

struct velocypack_options
{
   velocypack_options()
   {
      options.validateUtf8Strings = true;
      options.checkAttributeUniqueness = true;
   }

   arangodb::velocypack::Options options;
};

struct velocypack_parser
{
   velocypack_parser()
      : options(),
        parser()
   {}

   velocypack_options options;
   arangodb::velocypack::Parser parser;
};

class VELOCYPACKParseResult : public ParseResultBase {
public:
   std::shared_ptr< arangodb::velocypack::Builder > root;
};

class VELOCYPACKStringResult : public StringResultBase {
public:
   virtual const char* c_str() const { return s.c_str(); }

   std::string s;
};

class VELOCYPACKTest : public TestBase {
public:
#if TEST_INFO
   virtual const char* GetName() const { return "velocypack (C++11)"; }
   virtual const char* GetFilename() const { return __FILE__; }
#endif

#if TEST_PARSE
   virtual ParseResultBase* Parse(const char* json, size_t length) const {
      VELOCYPACKParseResult* pr = new VELOCYPACKParseResult;
      try {
         velocypack_parser parser;
         parser.parser.parse(reinterpret_cast<const uint8_t *>(json), length);
         pr->root = parser.parser.steal();
      }
      catch (...) {
         delete pr;
         return nullptr;
      }
      return pr;
   }
#endif

#if TEST_STRINGIFY
   virtual StringResultBase* Stringify(const ParseResultBase* parseResult) const {
      const VELOCYPACKParseResult* pr = static_cast<const VELOCYPACKParseResult*>(parseResult);
      VELOCYPACKStringResult* sr = new VELOCYPACKStringResult;
      sr->s = arangodb::velocypack::builderToJsonString( *pr->root );
      return sr;
   }
#endif

#if TEST_PRETTIFY
   virtual StringResultBase* Prettify(const ParseResultBase* parseResult) const {
      const VELOCYPACKParseResult* pr = static_cast<const VELOCYPACKParseResult*>(parseResult);
      VELOCYPACKStringResult* sr = new VELOCYPACKStringResult;
      sr->s = arangodb::velocypack::builderToPrettyJsonString( *pr->root );
      return sr;
   }
#endif

#if TEST_STATISTICS
   virtual bool Statistics(const ParseResultBase* parseResult, Stat* stat) const {
      const VELOCYPACKParseResult* pr = static_cast<const VELOCYPACKParseResult*>(parseResult);
      ::memset(stat, 0, sizeof(Stat));
      GenStat(*stat, *pr->root);
      return true;
   }
#endif

// TEST_SAXROUNDTRIP does not involve velocypack (only taocpp/json).

// TEST_SAXSTATISTICS does not involve velocypack (only taocpp/json).

#if TEST_CONFORMANCE
   virtual bool ParseDouble(const char* json, double* d) const {
      try {
         velocypack_parser parser;
         parser.parser.parse( std::string( json ) );
         const auto builder = parser.parser.steal();
         arangodb::velocypack::Slice slice( builder->start() );
         if ( slice.type() == arangodb::velocypack::ValueType::Array ) {
            slice = slice.at( 0 );
            if ( slice.type() == arangodb::velocypack::ValueType::Double ) {
               *d = slice.getDouble();
               return true;
            }
         }
      }
      catch (...) {
      }
      return false;
   }

   virtual bool ParseString(const char* json, std::string& s) const {
      try {
         velocypack_parser parser;
         parser.parser.parse( std::string( json ) );
         const auto builder = parser.parser.steal();
         arangodb::velocypack::Slice slice( builder->start() );
         if ( slice.type() == arangodb::velocypack::ValueType::Array ) {
            slice = slice.at( 0 );
            if ( slice.type() == arangodb::velocypack::ValueType::String ) {
               arangodb::velocypack::ValueLength length;
               const char * string = slice.getString( length );
               s = std::string( string, length );
               return true;
            }
         }
      }
      catch (...) {
      }
      return false;
   }
#endif

};

REGISTER_TEST(VELOCYPACKTest);

Simran-B commented 5 years ago

@jsteemann The Performance section in the docs is still to be written. Do we have some numbers, maybe on some 3rd party site? https://github.com/arangodb/velocypack/blob/master/Performance.md#performance

ColinH commented 5 years ago

@kvahed @jsteemann FYI, just moved everything into the taojson branch in my fork, i.e. https://github.com/ColinH/velocypack/tree/taojson .

arangodb / velocypack

Add VelocyPack to the native JSON benchmark #1