dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.26k stars 8.72k forks source link

[DISCUSSION] Adopting JSON-like format as next-generation model format #3916

Closed trivialfis closed 5 years ago

trivialfis commented 5 years ago

As discussed in #3878 and #3886 , we might want a more extendable format for saving XGBoost model.

For now my plan is utilizing the JSONReader and JSONWriter implemented in dmlc-core to add experimental support for saving/loading model into Json file. Due to the fact that related area of code is quite messy and is dangerous to change, I want to share my plan and possibly an early PR as soon as possible so that someone could point out my mistakes earlier(there will be mistakes), and we don't make duplicated work. :)

@hcho3

hcho3 commented 5 years ago

@trivialfis Are we thinking of something like dump_model(dump_format='json'), except that we'd be able to both read and write the new model format? (JSON dump is currently write-only)

trivialfis commented 5 years ago

@hcho3 Save() and Load(), with dmlc::JSONwriter * instead of dmlc::stream* as parameter.

hcho3 commented 5 years ago

@trivialfis Awesome! We can add it as a plug-in.

trivialfis commented 5 years ago

@hcho3 My plan is we replace the old binary Save and Load in a few releases, so currently I'm trying to patch to code directly. When it's done, XGBoost will save to json by default, as long as we keep the old binary Load function in the code, new releases of XGBoost can still read old binary format. WDYT?

hcho3 commented 5 years ago

Since JSON support is going to be experimental for a while, I think we should make it as a plug-in. That way, you can merge in the experimental JSON parsing into the master right away, so that people can start experimenting with it. (I'd like to use it too!)

Here's my idea:

if(PLUGIN_JSON_MODEL_PARSER) list(APPEND SOURCES plugin/json_model_parser/json_model_parser.cc) add_definitions(-DXGBOOST_USE_JSON_MODEL_PARSER) endif()


* Intercept `Load()` and `Save()` methods in `src/learner.cc` using the `XGBOOST_USE_JSON_MODEL_PARSER` macro:
```cpp
void Load(dmlc::Stream* fi) override {
#ifdef XGBOOST_USE_JSON_MODEL_PARSER
  dmlc::istream is(fi);
  std::unique_ptr<dmlc::JSONReader> reader(new dmlc::JSONReader(&is));
  JSONModelParser::Load(this, reader.get());
  return;
#endif
  // existing code...
}

void Save(dmlc::Stream* fo) const override {
#ifdef XGBOOST_USE_JSON_MODEL_PARSER
  dmlc::ostream os(fo);
  std::unique_ptr<dmlc::JSONWriter> writer(new dmlc::JSONWriter(&os));
  JSONModelParser::Save(this, writer.get());
  return;
#endif
  // existing code...
}

For this to work, JSONModelParser needs to be a friend of LearnerImpl class.

EDIT. updated to fix typo

trivialfis commented 5 years ago

@hcho3 Good idea. This way we don't even need to change the interface.

hcho3 commented 5 years ago

@trivialfis Cool. I'll look forward to your pull request.

KOLANICH commented 5 years ago

I don't think that a text-based format is a good choice. Even though we wanna store models in git, it makes no sense to store them in a text format because they are changed completely on each training, so no profit from storing deltas, so we still will have to use git-lfs.

So a binary format is preferred as the main format as a more efficient and more easily parseable one.

thvasilo commented 5 years ago

I'd argue for text models, because binary models introduce long-term compatibility and maintenance burden, case in point, and recently there have been requests for adding additional parameters to the model that cannot be satisfied easily because of backward compatibility, see [1] and I remember users asking for the number of samples in each leaf.

A JSON model would make it much easier to extend, having a binary specification with explicit versions would also help greatly, see e.g. DataSketches. Other options like exporting/importing to JPMML or ONNX (not sure if they support decision trees) could be left for other repositories.

KOLANICH commented 5 years ago

because binary models introduce long-term compatibility and maintenance burden

Depends on design. Usually developers make the format extendable enough using tag-size-contents pattern or just checking version.

cannot be satisfied easily because of backward compatibility,

Then I guess we should design a new binary format with extendability in mind and deprecate the old one. We will have to have some schema anyway, so no need to use generic-purpose formats like BSON/msgpack.

Other options like exporting/importing to JPMML or ONNX (not sure if they support decision trees) could be left for other repositories.

onnx supports decision trees, but it is a generic-purpose format optimized for arbitrary computation graphs

hcho3 commented 5 years ago

I had someone tell me that XGBoost model has endian compatibility issue. Models trained with little-endian machine cannot be loaded onto big-endian machine. Endian compatibility is one reason to favor text based format (as long as we stick to UTF-8)

hcho3 commented 5 years ago

Also note that LightGBM uses INI-like format for model serialization.

KOLANICH commented 5 years ago

Endian compatibility is one reason to favor text based format (as long as we stick to UTF-8)

Endian-compatibility is not a big problem - we can just say that little endian should be used and transform endian on serialization. Or we can just save the endian in format and transform if it doesn't match machine endianness. We still can lay raw structures upon memory-mapped files, just need a conversion routine.

hcho3 commented 5 years ago

@KOLANICH That's a fair point. I think a good way to proceed is to have multiple parser plugins implemented and then hold a beauty contest to choose the best alternative for serialization format. The criteria are

  1. It should be performant.
  2. It should not introduce dependencies that makes compilation too difficult.
  3. It should be easy to extend without breaking backward compatibility (models given by old XGBoost should be able to load onto new XGBoost).
  4. It should be independent of endianness and machine architecture.
  5. It should be compact (as in byte size of the serialized model).
hcho3 commented 5 years ago

Personally, I'd eliminate Protocol Buffers from consideration. I've tried it in another project (dmlc/treelite) and it takes a long time to compile from source.

trivialfis commented 5 years ago

I ended up re-implementing a Json library due to it's too hard to get dmlc::JSONReader working without introducing breaking changes (Writer is fine).

As for other formats, the simple facts are:

  1. We have different models and I want to remain the possibility of adding new ones.
  2. Only the model itself knows how to save/load its internal data.
  3. There are unlimited number of formats.

So a Json representation can be both a middle and an end format. One can transform its memory representation however they like. This can also be viewed as a reason for re-implementing a Json library, since the one in dmlc doesn't have a concrete data structure(it's a reader and a loader with a stream, not exactly a data structure).

hcho3 commented 5 years ago

@trivialfis You may be interested in reading this discussion: https://github.com/Microsoft/LightGBM/issues/372

@Laurae2 Any opinion on model format (binary vs text)?

trivialfis commented 5 years ago

@hcho3 I kept the binary model and it's still default. I thought about the performance issue, and it turns out there are 3-party specifications for binary form of JSON. Just search "binary Json" should return a lot.

hcho3 commented 5 years ago

@trivialfis Thanks for the pointer. BSON looks like a good choice for binary encoding of JSON, given its permissive license (Apache 2.0) and support from MongoDB.

Also, Microsoft/LightGBM#1083 uses multi-threading to speed up serialization of large models. I'll leave more details comments in #3930 about potential use of OpenMP for serializing trees in parallel.

hcho3 commented 5 years ago

@trivialfis MessagePack also looks like a good one as well.

hcho3 commented 5 years ago

I am now fully on board with using JSON-like format for serialization. It's extensible and (with proper binary encoding) can be made performant.

KOLANICH commented 5 years ago

I don't think that it is necessary.

So there are 4 options:

  1. use a custom format entirely.

    • pros:
      • better efficiency
    • cons:
      • right design of a custom format and its serializer and deserializer (deserializer can be written in kaitai so porting is not a problem, the problem is serializer)
  2. use a custom format with embedded msgpack for metadata.

    • pros:
      • ability to insert some useful arbitrary metadata without much effort
    • cons:
      • people will tend to inserting tonn of irrelevant data into the model
      • additional dependency
      • arbitrarity: we'll have to have additional logic walking the array and checking tags
      • attack surface: bson and msgpack allow references, including self-references.
  3. use bson/msgpack with raw extensions

    • pros: the same
    • cons
      • we still have to write and port a parser, but a level higher
  4. use bson/msgpack entirely:

    • pros:
      • compatibility with other programming languages - no need to rewrite binary parser and serializer
    • cons:
      • efficiency. There will be overhead for most of integer and float storage, one byte per value is the overhead for msgpack, bson has larger overhead since it stores names, and arrays are stored as objects with names of numbers in text form.
hcho3 commented 5 years ago

@KOLANICH We can go with option 4. Note that I said "JSON-like format" in my earlier post. BSON/MSGPACK is binary encoding of JSON.

efficiency. There will be overhead for most of integer and float storage

I'm not that concerned with storage efficiency. With respect to serialization performance, BSON/MSGPACK seems to be decent. See https://github.com/ludocode/schemaless-benchmarks.

trivialfis commented 5 years ago

Okay, I will focus the discussion in this issue rather than #3930 . My proposal is

  1. Implement a format independent data structure storing our models in memory during serialization as the middle form. Copied from #3930 :

During serialization, we will have a middle form of our models. In this PR, it's the Json class, which can be easily translated into our actual models with all parameters and model specific data preserved. Adding new parameters/models/data won't cause trouble since the class that make the changes will know how to deal with the changes by itself. While the old binary dump doesn't have such representation, to add new field, one have to hack the learner model parameter in a very careful way, and there's no way to propagate loaded data except for string parameters to other classes.

As for how the Json class is stored in disk doesn't effect Json class itself, you can traversal Json's nodes tree and obtain a complete different output format, say BSON. What I meant, specifically, is this Json class.

3930 is already a working example (albeit Linux only currently with MSVC not building), you can use the patch from it to generate a json file to see how's it look like and room for improvement.

  1. Let's first introduce plain utf-8 text Json as the end format, then consider other formats later. From this, we can see how's it working, if it's worth the effort, possibility of new troubles or whether the extendibility is indeed improved. Since once we have a middle form representation, implementing other formats are just items on the check list. Anyone who feel the need can implement it.
hcho3 commented 5 years ago

Implement a format independent data structure storing our models in memory during serialization as the middle form

In other words, we want nested key-value stores as intermediate representation. We don't need full JSON for this purpose. Let me try to come up with a simpler alternative by tomorrow.

The idea is to 1) have all components generate nested key-value stores and 2) have the serializer convert it into the end format, such as JSON. This design ensures that we can easily switch the end format later (JSON -> MSGPACK). For 2), we can use a 3rd-party library to generate fully compliant JSON files.

trivialfis commented 5 years ago

@hcho3 Nice.

hcho3 commented 5 years ago

@trivialfis BTW, good job with RTTI. I find it educational and edifying. This will come in handy for the nested key-value store I'm designing.

KOLANICH commented 5 years ago

In other words, we want nested key-value stores as intermediate representation.

Do we really need them nested? Nested ones can be used for generic-purpose schemaless serialization, but our data is schemeful, and it depends on native compiled code.

I mean that we know in advance that a tree has some integer parameters like feature number, split margin and subtrees ids, we don't need to put arbitrary data into a subtree, and even if we do, the code will be unable to use them unless it is explicitly coded to use them, which means that the data is strongly typed and introducing weak-typed stuff like json would mean additional burden of type checking and error handling.

hcho3 commented 5 years ago

additional burden of type checking and error handling

@KOLANICH To be this is a feature, not a bug. Right now, the parser will happily read whatever the binary stream will provide. We do have some sanity checks (num_node should not be negative etc) but there are many holes. With the new proposal, we're forced to handle errors. If we use 3rd-party library such as RapidJSON, error handling should not be too difficult.

generic-purpose schemaless serialization, but our data is schemeful, and it depends on native compiled code

But we want to be able to add new fields in the future without breaking backward compatibility. I am choosing JSON (or its binary encoding), because I don't want to deal with the hassle of custom-designing binary format.

trivialfis commented 5 years ago

@hcho3 Nice and thanks, if you were to carry out the implementation, I will try to see if it's possible to merge some parts of #3930 into the whole restructuring, it's already a working parser/dumper with a full blown key-value structure. Please keep me posted. :)

@KOLANICH It all comes down to less fuss in the future. Personally I want to spend more time on algorithms instead of these structuring bug fixing backward compatibility drama. So I will prefer any idea that brings less headache for future.

KOLANICH commented 5 years ago

With the new proposal, we're forced to handle errors.

But the set of errors you will have to check for would be much wider. Now the checks you have to process are range checking, but with a generic-purpose format you will have to check for types, names (assumming that the format is designed the way to put burden of range checking on serialization lib), that types match names and that it is not possible to mess with their order to maliciously interfere with the program (for a binary format this comes for free, you just parse what you expect).

For me range-checking looks like an easier task than that.

trivialfis commented 5 years ago

ut with a generic-purpose format you will have to check for types, names (assumming that the format is designed the way to put burden of range checking on serialization lib), that types match names and their order (for a binary format this comes for free, you just parse what you expect).

That's the good thing of having an algorithmic structure, please see #3930 , all the checks are seamlessly built into the structure. Anything wrong will be caught. For example, if you have the wrong type, you won't be able to convert it into actual value since the type information is encoded into the structure by RTTI implementation.

hcho3 commented 5 years ago

But the set of errors you will have to check for would be much wider.

Not really. Each class will look for a list of names (raise error if a name is not found), and then check if the found items are right types (raise error if a name has wrong type). And the order of fields doesn't matter, as JSON is un-ordered.

KOLANICH commented 5 years ago

And the order of fields doesn't matter, as JSON is un-ordered.

It's true, if we parse the format entirely first, and then use it as a key-value storage, but it has some overhead related to search. There is another approach, when parsing is done on the fly and our function is called on each record, then the arguments would likely be processed in the order they are enumerated in the file. The argument about ordering has been accepted.

trivialfis commented 5 years ago

@KOLANICH

but it has some overhead related to search

Most implementation of std::map is red black tree, searching is performed in O(logn). In XGBoost each nested level should contain less than 10 entries, with this known bounded n <= 10, it's actually O(1). The overhead is a non-issue.

when parsing is done on the fly and our function is called on each record

That will be a whole rewrite of XGBoost. Dealing with large data blob, the better approach is to utilize B-Tree like data format, for example hdf5. We don't need to concern about it ourselves.

hcho3 commented 5 years ago

@trivialfis And we can use std::unordered_map for greater efficiency: http://supercomputingblog.com/windows/ordered-map-vs-unordered-map-a-performance-study/

trivialfis commented 5 years ago

@hcho3 Actually I doubt you can tell the difference between a naive linked-list with a sophisticated hash table in our case. As commented above it's very small operation O(1). The real concern I have during implementing it is the memory usage. Copying models into another structure is a real pain, especially with everything serialized.

hcho3 commented 5 years ago

The real concern I have during implementing it is the memory usage. Copying models into another structure is a real pain, especially with everything serialized.

Let's discuss this once I have a concrete proposal for nested key-value stores. I think it'll be something like what you have, but a bit simplified.

trivialfis commented 5 years ago

@hcho3 Great. I can try to drop the Json::Dump, Json::Load and Json::Null tomorrow after seeing your proposal. See if it fits. If not then I will close that PR. All seems pretty good.

tqchen commented 5 years ago

Given that we already have a binary format. I think it is fine to just go ahead and support a json version. The two version are likely going to be sufficient

tqchen commented 5 years ago

For storing trees, my recommendation would be not storing them as nested data structure, instead store them as adjacency list just like the nnvm graph format. It should not hard to convert back and forth

hcho3 commented 5 years ago

@tqchen A prototype is available at #3930.

store them as adjacency list

I'd like to hear your reasoning about using adjacency list instead of nested dictionaries.

trivialfis commented 5 years ago

@tqchen Can you elaborate a little bit on using graph instead of tree?

tqchen commented 5 years ago

The main reason for using adj-list instead of nesting tree is the ease for parsing and edit, because as the tree went deeper(say 20 depth), it is really visually see through the tree and do editing on them. Many of the visualization libraries also depend on the adj-list structure.

The general idea of adj list is that we mark nodes as ids, and record child id of each node. So we will have things like follows(index is not really necessary):

[
  {"index": 0, "children" : [1, 2]},
  {"index": 1, "children" : [4, 5]},
]

This kind of structure is easy to edit, and there is no problem of nested parsing which can create trouble in deep nests. Another advantage is that it avoids introducing 3rdparty modules, which we usually need to be very careful about, and can directly make use of dmlc/json.h to do the parsing jobs.

tqchen commented 5 years ago

As you may have noticed the current in-memory data structure itself is already adj-list structure(with its cleft, cright field as integer), so it is actually easier to adopt this structure in json as well

KOLANICH commented 5 years ago

How about using arrays instead of objects to store nodes?

I mean

[1, 2, 1, 0.5]

instead of

{"children" : [1, 2], "defaultLeft":true, "margin": 0.5}

We trade clarity for a more compact blob.

And one more note - it may make sense to store large integers in hex.

tqchen commented 5 years ago

since the binary format is already compact, json format is mainly aimed at readability, so maybe we do not have to this specialization. Notably, having a clear field allows parsing from another language easy and readable(e.g. js). If we want to have a compact blob with json, just zip it or use binary format.

trivialfis commented 5 years ago

@tqchen In current prototype, the nodes from decision trees are stored in array, which means it's flat.

The main reason for introducing JSON format is make saving parameters easier. The current binary form disallow changes in storing parameters. And accroding to @hcho3 , endien is also an issue in binary format.

tqchen commented 5 years ago

The endian issue should be solved already if we store vector of floats, see https://github.com/dmlc/dmlc-core/pull/406

I totally support using json format, though I guess we might not need third party json libaries, dmlc/json.h might work just fine. See https://github.com/dmlc/tvm/blob/master/src/lang/reflection.cc#L398

KOLANICH commented 5 years ago

The main reason for introducing JSON format

IMHO is compatibility: JSON libraries are present for almost every lang, so we get an almost free (in terms of programmer's effort) parser and serializer, but Kaitai is currently limited to some of the most popular languages and doesn't support serialization currently at all.