ExpandingMan commented 1 year ago

This PR is a complete rewrite of this package. The existing code dates back to the early days of Julia and it was due for a comprehensive overhaul. I realize that, having taken the liberty of rewriting everything, this may not be accepted, in which case I'd be happy to set up a fork from which you guys can take anything you'd like, however I believe most of the changes here should be uncontroversial so hopefully after a review process we can get this merged.

As of writing, this PR is mostly feature-complete, what remains is mostly documentation and unit tests, neither of which should take very long. I wanted to get the PR up now so that maintainers can get a look at it sooner rather than later. I will update the PR unambiguously when I consider it completely "done".

Summary of Changes

Removed Features

Cross validation. By now there are alternatives in the Julia ML ecosystem which are specifically designed for performing cross validation and I do not believe it is still appropriate to have a (probably poorly-maintained) alternative in this package. (See for example MLJXGBoostInterface.jl which can be used for extremely comprehensive cross-validation, well beyond what was ever available within this package.)

Features that still need to be improved

I haven't done much yet with evaluation logs, so this may not be as complete as it was before this PR.
The existing thinly-wrapped rabit calls have been removed as I didn't really understand what they were for. These methods are still accessible through the wrapped C lib. It looks like they can be added with little effort but I wanted more context before I did so. If anyone feels it's important to keep this I can look into it further.

Added Features

The xgboost C library (via XGBoost_jll) now has an automated wrapper via Clang.jl. The result is a lower-level wrapper than what was originally written, however I have taken care to use the functions from the generated wrapper only where needed, so overall the library wrapping situation should be much better (it also now includes all methods defined in the C API header file).
Comprehensive re-work of constructors. There were some pretty strange things going on with the constructors in current master probably owing to how early in the history of Julia this package was originally written. These now make a lot more sense, in particular DMatrix and Booster each have exactly 1 private constructor which takes a handle pointer and all other constructor methods wrap this. Additionally there are better DMatrix constructors for standard use cases.
DMatrix can now be constructed with arrays containing missing which libxgboost will interpret as missing data points (what exactly xgboost does with that information is still mysterious to me).
DMatrix can now be constructed from any (real-valued) Tables.jl compatible table and the column names will be stored in DMatrix as feature names.
I have added a large number of "lower-level" or "intermediate level" methods which allow interaction with DMatrix and Booster objects in ways which might be more complicated than simple calls to training and prediction, including docstrings. I have chosen names consistent with the conventions of Julia Base.
Introspection of DMatrix: sadly not much was possible here since libxgboost provides very few methods for getting data back out of DMatrix, but it is now possible to at least call Base.size and easily see set feature names.
Construction of DMatrix from iterators. This seems to involve merely caching the output of an iterator in some format that xgboost likes, though you can probably imagine that I was quite disappointed when I realized that this doesn't seem to allow any parallelization (e.g. you can't run a data-loading iterator while training is happening). Nevertheless this can still be quite useful in some situations and it is fully implemented on arbitrary Julia iterators, see also here.
Booster objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C object, but they are incredibly useful for user-facing introspection, so I decided to store them. Only non-default parameters can be stored since I store them on the Julia side, and I'm still not 100% sure what happens in cases of deserialized models, but I am convinced that what I have done here is extremely useful so that users can see the model they are using. I have not provided methods for fetching these parameters so nobody can use them for shenanigans (i.e. I am paranoid about them getting out of sync, though I think I have ensured that can't happen).
I have sanitized the update (now called update! and updateone!) training functions. These should very much be considered public and now are documented and make a much clearer distinction between inputs that are merely for logging.
Evaluation outputs now go through Julia's Logging stdlib so that they can be controlled via the API (there are also more direct ways of suppressing log output).
xgboost should now have unambiguous semantics consistent with the Booster and DMatrix constructors. It should essentially be considered an alias for Booster followed by update!.
Feature importances are now generated by the libxgboost method XGBoosterFeatureScore rather than parsing a string-dumped model (this feature probably didn't exist when the original Julia function was written). Thanks to storing feature_names in the Booster this now outputs a dictionary with correct feature names (e.g. feature names propagated all the way from a Tables.jl-compatible input table).
Models can now be dumped not only to unformatted text but to JSON strings. Parsed JSON can be used to construct the new Node objects which are AbstractTrees.jl compatible tree node objects representing the trees and are capable of containing all data that can be provided by libxgboost. Type-stable iteration over the resulting trees will be possible as soon as type-stable iteration is fixed in AbstractTrees.jl. (This can be retrieved quite simply with trees(b::Booster).)
I have added a huge amount of "reflection and introspection" stuff much of which consists of pretty terminal outputs from Term.jl. The main aspect of this is that the MIME"text/plain" show methods (default REPL output) now display highly-legible Term.jl panels which display relevant information such as features, the number of boosted rounds and non-default model hyper-parameters. A visual preview which I intend to include in an upcoming revised README can be seen here.
showimportance method for getting an easily readable summary of feature importances (again thanks to Term.jl).

Possible Future Features (after this PR)

The libxgboost functions accept CUDA array handles, so I don't think GPU integration should be particularly difficult, though we will have to change xgboost_jll to include CUDA binaries and I'm not sure how easy that is to do.
Custom objective functions currently require users to return both first and second order derivatives, so it might be nice to add an auto-diff dependency which automatically handles this.

Breaking Changes?

Surprisingly, there is likely a large class of existing use cases which this PR does not break. In particular the functions xgboost and predict can be used very similarly to how they were used before. On the other hand I have done so much here that I did not want to feel constrained by the existing API and we of course still have to grapple with the Julia-wide problem of it being ambiguous what exactly is public and what is private. I think I've struck a good compromise here. In particular MLJXGBoostWrapper.jl can be updated with very little effort.

New Dependencies

Tables.jl: This is necessary for DMatrix to take generic table-like arguments.
AbstractTrees.jl: This provides a tree-like interface to Node objects which serve as the Julia-side representation of the tree models.
JSON3.jl: There are a few places where xgboost internals take JSON arguments (in some cases I find this rather dubious, to be honest), so adding a JSON package is mandatory to fully cover the library. I'm also using it to parse the dumped models which can be output as JSON rather than text (removes all regex from this package).
Term.jl: This addition is perhaps a bit more frivolous than the above, but I think you'll agree it provides some really excellent display for introspection of the wrapped objects. In particular the trees are hard to display so nicely without Term.jl as a dependency.

TODO

[x] Unit tests! I'll definitely have to add a bunch of these for new features as well as adapting the existing ones.
[x] Documenter.jl documentation. This is more necessary now that I have docstrings so we'll need API docs.
[x] Replace whatever functionality might have gotten lost with eval callbacks.
[x] Random forest mode? Other default modes? This isn't really necessary but I think it would be nice to have. xgboost can cover a huge class of ML models and I wish this were a little more apparent from the API.

ablaom commented 1 year ago

@aviks This looks like a very valuable contribution, and the poster is a very capable and Julia-active developer. Given the scale of the changes (this is essentially a rewrite) it may be difficult to find someone willing to make a detailed review, and it would be a shame if that held this back. Since this will be tagged as breaking anyway, perhaps a shorter, testing-focused review would suffice?

aviks commented 1 year ago

Yeah, I'm not too worried about the size of the rewrite, I'll take a look in the daytime tomorrow.

My one concern is that on one hand we'll probably want such a large rewrite to be a new, breaking, major version number, but on the other hand, I've tried to be somewhat be aligned to upstream version numbers. Not sure what to do about that...something's gotta give.

ExpandingMan commented 1 year ago

I think the version number of this package should be considered a completely separate wrapper version number. It doesn't seem realistic to couple them in some way: even if you decided not to merge this you'd be stuck not making breaking changes for what would probably be a very long time as xgboost itself is quite stable, and clearly what's currently on master is very old in Julia terms.

For what it's worth, the Python package contains functionality that is not present in the base C library. Therefore they would have to either bump the version of the entire xgboost repo, have a stuck wrapper, or be rather disingenuous about semantic versioning. So, considering the content of the root library and wrappers coupling the versioning seems like a dubious proposition all around.

trivialfis commented 1 year ago

cc @dmlc/xgboost-committer If a rewrite is due, we might want to participate in this repository and see if there's anything we can help with.

ExpandingMan commented 1 year ago

Alright, this should do it.

Note that I have updated all of the CI/CD stuff with latest templates (testing, docs, tagbot, compathelper all via github actions). I'm always rather confused about how github handles that, I'm not sure if updated test and doc templates will run if you run the workflows from here.

I have set the version to 2.0. Again, I don't see any reasonable way around this. I'm willing to maintain a fork if you guys prefer, but it already sounds like there is an appetite for merging this.

Rather than updating the demo files I have added extensive documentation with Documenter.jl complete with API docs. It should more than cover what used to be covered in the demo files, so I didn't really see the point in keeping equivalents around. Of course if you feel the docs are lacking I can add more. (Again I have removed cross-validation, so there are no docs for that.)

Unit testing is not super extensive, obviously we are leaning heavily on libxgboost being well tested, but they should be somewhat more complete than they used to be, I'd like to think the coverage isn't too bad (though coverage always turns out to be less than I hope).

Note also that I set the documentation link to https://dmlc.github.io/XGBoost.jl which should be the default location for them.

Ok, so as far as I know this is done, so there won't be further actions from me until requested. Thanks!

codecov-commenter commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@f9793f3). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #111 +/- ## ========================================= Coverage ? 58.18% ========================================= Files ? 6 Lines ? 574 Branches ? 0 ========================================= Hits ? 334 Misses ? 240 Partials ? 0 ``` Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dmlc). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dmlc)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ExpandingMan commented 1 year ago

I believe I fixed the docs (there was a missing Project.toml).

No idea what's with coverage, any suggestions?

Also no idea why windows failed, I half expect it to be intermittent.

ExpandingMan commented 1 year ago

Same failure twice on windows. It's occurring in libxgboost could it be a binary builder issue?

ExpandingMan commented 1 year ago

Can we run this again just to check the CI/CD changes I just made? I'm also secretly hoping that the windows tests will magically succeed this time since I have absolutely no clue what can be done about their failure.

ExpandingMan commented 1 year ago

Yay! Thanks @aviks, didn't even occur to me that windows was not running on x86_64.

aviks commented 1 year ago

The previous CI run has failed due to a syntax error in the yml file, which is now fixed

The tests seem to pass in the linux CI run, however there are exceptions in the logs like so:

┌ Error: got error during C callback for resetting iterator
│   exception =
│    MethodError: no method matching reset!(::Base.Iterators.Stateful{Vector{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}}, Union{Nothing, Tuple{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}, Int64}}})

Is that expected?

The Windows CI was originally defined for 32 bits, which xgboost does not support. on fixing that, it passes

aviks commented 1 year ago

That callback error -- seems to be 1.6 only.

ExpandingMan commented 1 year ago

Apparently there is a missing method (or rather missing default argument) for Iterators.reset! in 1.6 that has since been fixed. I have changed the method so that it shouldn't hit that error. You can try tests again when ready.

Btw, the reason that the 1.6 error was not actually resulting in a failure is because it is incredibly hard to test DMatrix objects since they are basically 1-directional (i.e. you can't get data back out). I suppose tests could be expanded to make it less likely that stuff like that would get through, but it would pretty much require training and testing models for every unit tests that involves DMatrix, which would be rather onerous.

trivialfis commented 1 year ago

Maybe this can help: https://github.com/dmlc/xgboost/pull/8269

ExpandingMan commented 1 year ago

Maybe this can help: dmlc/xgboost#8269

Yeah, would be really great if that gets merged, at which point we can add tests. DMatrix being so opaque is very harmful for testing.

trivialfis commented 1 year ago

It merged now.

ExpandingMan commented 1 year ago

I know this is a major overhaul so the request to review it is not a small one, but it's been a while since there's been any movement here so I'm bumping it.

trivialfis commented 1 year ago

what exactly xgboost does with that information is still mysterious to me)

XGBoost removes it.

Introspection of DMatrix

We will have 1.7 rc in near future. So maybe this PR can be based on newer version of XGBoost and utilize https://github.com/dmlc/xgboost/pull/8269

e.g. you can't run a data-loading iterator while training is happening

That happens inside XGBoost. But feel free to make suggestions.

objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C objec

They can be retrieved from the underlying c object. For feature names and types: https://xgboost.readthedocs.io/en/latest/c.html#_CPPv426XGBoosterGetStrFeatureInfo13BoosterHandlePKcP9bst_ulongPPPKc

Hyper-parameters are discarded during save_model, which is by design, see https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html . If you want to retrieve these parameter before it's being discarded, check out https://xgboost.readthedocs.io/en/latest/c.html#_CPPv423XGBoosterSaveJsonConfig13BoosterHandleP9bst_ulongPPKc

trivialfis commented 1 year ago

I can come back to this after 1.7 in xgboost if any help is needed. Will try to build some small models with this PR myself to get some intuition on how it works. In the mean while, feel free to ping me for any questions/suggestions.

rikhuijzer commented 1 year ago

So maybe this PR can be based on newer version of XGBoost and utilize dmlc/xgboost#8269

Sounds like that's a better idea for a next PR to avoid scope creep.

ExpandingMan commented 1 year ago

Are there any specific requests for changes in this PR?

If you are willing I'd prefer to address new features such as introspection of DMatrix in a future PR rather than dragging this one out.

tylerjthomas9 commented 1 year ago

I am pretty interested in this PR being merged. I am bumping it to see if anything else needs to be addressed in this PR.

aviks commented 1 year ago

OK, this PR is now merged. However, I'd still like a discussion on on version numbers. It seems to me that the python package keeps the same versioning as the C++ library, and they are released together.

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

aviks commented 1 year ago

Also, @ExpandingMan would you please take a quick look at the three PRs that are open in this repo. Would it be possible to migrate any of these to the new codebase? If not, we should close them, but I'd like your eyes on them before doing that.

rikhuijzer commented 1 year ago

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

Confusion is less bad than breaking in my opinion.

ExpandingMan commented 1 year ago

I have commented on these, see above. In summary: no none of them are directly applicable and they are somewhat easier to achieve now that a lower-level API is exported, however it might be nice in the future to more easily facilitate the functionality proposed in those PR's.

I don't know what's going on with documentation here... it's reporting that the job is completely successfully but all of the links are broken... anybody know what to do here? It's possible that an admin just has to declare it a used feature for this package.

ablaom commented 1 year ago

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

I agree that we should tag as breaking (new major version). Perhaps a compromise would be to add 100 to the major version number and preserve the minor/patch numbers to match the source library. So, source version = 1.5.2 means julia version = 101.5.2

ExpandingMan commented 1 year ago

In my opinion it's crazy to try to make the version numbers track each other in some way, how would that even look for future versions? It's a very common practice for non-trivial wrappers to have their own version numbers. Furthermore, the libxgboost version number is already visible and easily manageable in Julia via the jll. I don't see any downsides to decoupling the version numbers, it looks like an obvious choice to me.

aviks commented 1 year ago

I fundamentally disagree, and think this is quite user-hostile, but it doesn't seem like it's worthwhile to argue this point any more.

dmlc / XGBoost.jl

massive overhaul #111

Summary of Changes

Removed Features

Features that still need to be improved

Added Features

Possible Future Features (after this PR)

Breaking Changes?

New Dependencies

TODO

Codecov Report