Closed trivialfis closed 5 years ago
@trivialfis Are we thinking of something like dump_model(dump_format='json')
, except that we'd be able to both read and write the new model format? (JSON dump is currently write-only)
@hcho3 Save()
and Load()
, with dmlc::JSONwriter *
instead of dmlc::stream*
as parameter.
@trivialfis Awesome! We can add it as a plug-in.
@hcho3 My plan is we replace the old binary Save
and Load
in a few releases, so currently I'm trying to patch to code directly. When it's done, XGBoost will save to json by default, as long as we keep the old binary Load
function in the code, new releases of XGBoost can still read old binary format. WDYT?
Since JSON support is going to be experimental for a while, I think we should make it as a plug-in. That way, you can merge in the experimental JSON parsing into the master right away, so that people can start experimenting with it. (I'd like to use it too!)
Here's my idea:
option(PLUGIN_JSON_MODEL_PARSER "Use JSON for model input and output" OFF)
if(PLUGIN_JSON_MODEL_PARSER) list(APPEND SOURCES plugin/json_model_parser/json_model_parser.cc) add_definitions(-DXGBOOST_USE_JSON_MODEL_PARSER) endif()
* Intercept `Load()` and `Save()` methods in `src/learner.cc` using the `XGBOOST_USE_JSON_MODEL_PARSER` macro:
```cpp
void Load(dmlc::Stream* fi) override {
#ifdef XGBOOST_USE_JSON_MODEL_PARSER
dmlc::istream is(fi);
std::unique_ptr<dmlc::JSONReader> reader(new dmlc::JSONReader(&is));
JSONModelParser::Load(this, reader.get());
return;
#endif
// existing code...
}
void Save(dmlc::Stream* fo) const override {
#ifdef XGBOOST_USE_JSON_MODEL_PARSER
dmlc::ostream os(fo);
std::unique_ptr<dmlc::JSONWriter> writer(new dmlc::JSONWriter(&os));
JSONModelParser::Save(this, writer.get());
return;
#endif
// existing code...
}
For this to work, JSONModelParser
needs to be a friend of LearnerImpl
class.
EDIT. updated to fix typo
@hcho3 Good idea. This way we don't even need to change the interface.
@trivialfis Cool. I'll look forward to your pull request.
I don't think that a text-based format is a good choice. Even though we wanna store models in git, it makes no sense to store them in a text format because they are changed completely on each training, so no profit from storing deltas, so we still will have to use git-lfs.
So a binary format is preferred as the main format as a more efficient and more easily parseable one.
I'd argue for text models, because binary models introduce long-term compatibility and maintenance burden, case in point, and recently there have been requests for adding additional parameters to the model that cannot be satisfied easily because of backward compatibility, see [1] and I remember users asking for the number of samples in each leaf.
A JSON model would make it much easier to extend, having a binary specification with explicit versions would also help greatly, see e.g. DataSketches. Other options like exporting/importing to JPMML or ONNX (not sure if they support decision trees) could be left for other repositories.
because binary models introduce long-term compatibility and maintenance burden
Depends on design. Usually developers make the format extendable enough using tag-size-contents pattern or just checking version.
cannot be satisfied easily because of backward compatibility,
Then I guess we should design a new binary format with extendability in mind and deprecate the old one. We will have to have some schema anyway, so no need to use generic-purpose formats like BSON/msgpack.
Other options like exporting/importing to JPMML or ONNX (not sure if they support decision trees) could be left for other repositories.
onnx supports decision trees, but it is a generic-purpose format optimized for arbitrary computation graphs
I had someone tell me that XGBoost model has endian compatibility issue. Models trained with little-endian machine cannot be loaded onto big-endian machine. Endian compatibility is one reason to favor text based format (as long as we stick to UTF-8)
Also note that LightGBM uses INI-like format for model serialization.
Endian compatibility is one reason to favor text based format (as long as we stick to UTF-8)
Endian-compatibility is not a big problem - we can just say that little endian should be used and transform endian on serialization. Or we can just save the endian in format and transform if it doesn't match machine endianness. We still can lay raw structures upon memory-mapped files, just need a conversion routine.
@KOLANICH That's a fair point. I think a good way to proceed is to have multiple parser plugins implemented and then hold a beauty contest to choose the best alternative for serialization format. The criteria are
Personally, I'd eliminate Protocol Buffers from consideration. I've tried it in another project (dmlc/treelite) and it takes a long time to compile from source.
I ended up re-implementing a Json library due to it's too hard to get dmlc::JSONReader
working without introducing breaking changes (Writer
is fine).
As for other formats, the simple facts are:
So a Json representation can be both a middle and an end format. One can transform its memory representation however they like. This can also be viewed as a reason for re-implementing a Json library, since the one in dmlc doesn't have a concrete data structure(it's a reader and a loader with a stream, not exactly a data structure).
@trivialfis You may be interested in reading this discussion: https://github.com/Microsoft/LightGBM/issues/372
@Laurae2 Any opinion on model format (binary vs text)?
@hcho3 I kept the binary model and it's still default. I thought about the performance issue, and it turns out there are 3-party specifications for binary form of JSON. Just search "binary Json" should return a lot.
@trivialfis Thanks for the pointer. BSON looks like a good choice for binary encoding of JSON, given its permissive license (Apache 2.0) and support from MongoDB.
Also, Microsoft/LightGBM#1083 uses multi-threading to speed up serialization of large models. I'll leave more details comments in #3930 about potential use of OpenMP for serializing trees in parallel.
@trivialfis MessagePack also looks like a good one as well.
I am now fully on board with using JSON-like format for serialization. It's extensible and (with proper binary encoding) can be made performant.
I don't think that it is necessary.
So there are 4 options:
use a custom format entirely.
use a custom format with embedded msgpack for metadata.
use bson/msgpack with raw extensions
use bson/msgpack entirely:
@KOLANICH We can go with option 4. Note that I said "JSON-like format" in my earlier post. BSON/MSGPACK is binary encoding of JSON.
efficiency. There will be overhead for most of integer and float storage
I'm not that concerned with storage efficiency. With respect to serialization performance, BSON/MSGPACK seems to be decent. See https://github.com/ludocode/schemaless-benchmarks.
Okay, I will focus the discussion in this issue rather than #3930 . My proposal is
During serialization, we will have a middle form of our models. In this PR, it's the Json class, which can be easily translated into our actual models with all parameters and model specific data preserved. Adding new parameters/models/data won't cause trouble since the class that make the changes will know how to deal with the changes by itself. While the old binary dump doesn't have such representation, to add new field, one have to hack the learner model parameter in a very careful way, and there's no way to propagate loaded data except for string parameters to other classes.
As for how the Json class is stored in disk doesn't effect Json class itself, you can traversal Json's nodes tree and obtain a complete different output format, say BSON. What I meant, specifically, is this Json class.
Implement a format independent data structure storing our models in memory during serialization as the middle form
In other words, we want nested key-value stores as intermediate representation. We don't need full JSON for this purpose. Let me try to come up with a simpler alternative by tomorrow.
The idea is to 1) have all components generate nested key-value stores and 2) have the serializer convert it into the end format, such as JSON. This design ensures that we can easily switch the end format later (JSON -> MSGPACK). For 2), we can use a 3rd-party library to generate fully compliant JSON files.
@hcho3 Nice.
@trivialfis BTW, good job with RTTI. I find it educational and edifying. This will come in handy for the nested key-value store I'm designing.
In other words, we want nested key-value stores as intermediate representation.
Do we really need them nested? Nested ones can be used for generic-purpose schemaless serialization, but our data is schemeful, and it depends on native compiled code.
I mean that we know in advance that a tree has some integer parameters like feature number, split margin and subtrees ids, we don't need to put arbitrary data into a subtree, and even if we do, the code will be unable to use them unless it is explicitly coded to use them, which means that the data is strongly typed and introducing weak-typed stuff like json would mean additional burden of type checking and error handling.
additional burden of type checking and error handling
@KOLANICH To be this is a feature, not a bug. Right now, the parser will happily read whatever the binary stream will provide. We do have some sanity checks (num_node should not be negative etc) but there are many holes. With the new proposal, we're forced to handle errors. If we use 3rd-party library such as RapidJSON, error handling should not be too difficult.
generic-purpose schemaless serialization, but our data is schemeful, and it depends on native compiled code
But we want to be able to add new fields in the future without breaking backward compatibility. I am choosing JSON (or its binary encoding), because I don't want to deal with the hassle of custom-designing binary format.
@hcho3 Nice and thanks, if you were to carry out the implementation, I will try to see if it's possible to merge some parts of #3930 into the whole restructuring, it's already a working parser/dumper with a full blown key-value structure. Please keep me posted. :)
@KOLANICH It all comes down to less fuss in the future. Personally I want to spend more time on algorithms instead of these structuring bug fixing backward compatibility drama. So I will prefer any idea that brings less headache for future.
With the new proposal, we're forced to handle errors.
But the set of errors you will have to check for would be much wider. Now the checks you have to process are range checking, but with a generic-purpose format you will have to check for types, names (assumming that the format is designed the way to put burden of range checking on serialization lib), that types match names and that it is not possible to mess with their order to maliciously interfere with the program (for a binary format this comes for free, you just parse what you expect).
For me range-checking looks like an easier task than that.
ut with a generic-purpose format you will have to check for types, names (assumming that the format is designed the way to put burden of range checking on serialization lib), that types match names and their order (for a binary format this comes for free, you just parse what you expect).
That's the good thing of having an algorithmic structure, please see #3930 , all the checks are seamlessly built into the structure. Anything wrong will be caught. For example, if you have the wrong type, you won't be able to convert it into actual value since the type information is encoded into the structure by RTTI implementation.
But the set of errors you will have to check for would be much wider.
Not really. Each class will look for a list of names (raise error if a name is not found), and then check if the found items are right types (raise error if a name has wrong type). And the order of fields doesn't matter, as JSON is un-ordered.
And the order of fields doesn't matter, as JSON is un-ordered.
It's true, if we parse the format entirely first, and then use it as a key-value storage, but it has some overhead related to search. There is another approach, when parsing is done on the fly and our function is called on each record, then the arguments would likely be processed in the order they are enumerated in the file. The argument about ordering has been accepted.
@KOLANICH
but it has some overhead related to search
Most implementation of std::map
is red black tree, searching is performed in O(logn)
. In XGBoost each nested level should contain less than 10 entries, with this known bounded n <= 10
, it's actually O(1)
. The overhead is a non-issue.
when parsing is done on the fly and our function is called on each record
That will be a whole rewrite of XGBoost. Dealing with large data blob, the better approach is to utilize B-Tree
like data format, for example hdf5
. We don't need to concern about it ourselves.
@trivialfis And we can use std::unordered_map
for greater efficiency: http://supercomputingblog.com/windows/ordered-map-vs-unordered-map-a-performance-study/
@hcho3 Actually I doubt you can tell the difference between a naive linked-list with a sophisticated hash table in our case. As commented above it's very small operation O(1)
. The real concern I have during implementing it is the memory usage. Copying models into another structure is a real pain, especially with everything serialized.
The real concern I have during implementing it is the memory usage. Copying models into another structure is a real pain, especially with everything serialized.
Let's discuss this once I have a concrete proposal for nested key-value stores. I think it'll be something like what you have, but a bit simplified.
@hcho3 Great. I can try to drop the Json::Dump
, Json::Load
and Json::Null
tomorrow after seeing your proposal. See if it fits. If not then I will close that PR. All seems pretty good.
Given that we already have a binary format. I think it is fine to just go ahead and support a json version. The two version are likely going to be sufficient
For storing trees, my recommendation would be not storing them as nested data structure, instead store them as adjacency list just like the nnvm graph format. It should not hard to convert back and forth
@tqchen A prototype is available at #3930.
store them as adjacency list
I'd like to hear your reasoning about using adjacency list instead of nested dictionaries.
@tqchen Can you elaborate a little bit on using graph instead of tree?
The main reason for using adj-list instead of nesting tree is the ease for parsing and edit, because as the tree went deeper(say 20 depth), it is really visually see through the tree and do editing on them. Many of the visualization libraries also depend on the adj-list structure.
The general idea of adj list is that we mark nodes as ids, and record child id of each node. So we will have things like follows(index is not really necessary):
[
{"index": 0, "children" : [1, 2]},
{"index": 1, "children" : [4, 5]},
]
This kind of structure is easy to edit, and there is no problem of nested parsing which can create trouble in deep nests. Another advantage is that it avoids introducing 3rdparty modules, which we usually need to be very careful about, and can directly make use of dmlc/json.h to do the parsing jobs.
As you may have noticed the current in-memory data structure itself is already adj-list structure(with its cleft, cright field as integer), so it is actually easier to adopt this structure in json as well
How about using arrays instead of objects to store nodes?
I mean
[1, 2, 1, 0.5]
instead of
{"children" : [1, 2], "defaultLeft":true, "margin": 0.5}
We trade clarity for a more compact blob.
And one more note - it may make sense to store large integers in hex.
since the binary format is already compact, json format is mainly aimed at readability, so maybe we do not have to this specialization. Notably, having a clear field allows parsing from another language easy and readable(e.g. js). If we want to have a compact blob with json, just zip it or use binary format.
@tqchen In current prototype, the nodes from decision trees are stored in array, which means it's flat.
The main reason for introducing JSON format is make saving parameters easier. The current binary form disallow changes in storing parameters. And accroding to @hcho3 , endien is also an issue in binary format.
The endian issue should be solved already if we store vector of floats, see https://github.com/dmlc/dmlc-core/pull/406
I totally support using json format, though I guess we might not need third party json libaries, dmlc/json.h might work just fine. See https://github.com/dmlc/tvm/blob/master/src/lang/reflection.cc#L398
The main reason for introducing JSON format
IMHO is compatibility: JSON libraries are present for almost every lang, so we get an almost free (in terms of programmer's effort) parser and serializer, but Kaitai is currently limited to some of the most popular languages and doesn't support serialization currently at all.
As discussed in #3878 and #3886 , we might want a more extendable format for saving XGBoost model.
For now my plan is utilizing the
JSONReader
andJSONWriter
implemented indmlc-core
to add experimental support for saving/loading model into Json file. Due to the fact that related area of code is quite messy and is dangerous to change, I want to share my plan and possibly an early PR as soon as possible so that someone could point out my mistakes earlier(there will be mistakes), and we don't make duplicated work. :)@hcho3