dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

massive overhaul #111

Closed ExpandingMan closed 1 year ago

ExpandingMan commented 1 year ago

This PR is a complete rewrite of this package. The existing code dates back to the early days of Julia and it was due for a comprehensive overhaul. I realize that, having taken the liberty of rewriting everything, this may not be accepted, in which case I'd be happy to set up a fork from which you guys can take anything you'd like, however I believe most of the changes here should be uncontroversial so hopefully after a review process we can get this merged.

As of writing, this PR is mostly feature-complete, what remains is mostly documentation and unit tests, neither of which should take very long. I wanted to get the PR up now so that maintainers can get a look at it sooner rather than later. I will update the PR unambiguously when I consider it completely "done".

Summary of Changes

Removed Features

Features that still need to be improved

Added Features

Possible Future Features (after this PR)

Breaking Changes?

Surprisingly, there is likely a large class of existing use cases which this PR does not break. In particular the functions xgboost and predict can be used very similarly to how they were used before. On the other hand I have done so much here that I did not want to feel constrained by the existing API and we of course still have to grapple with the Julia-wide problem of it being ambiguous what exactly is public and what is private. I think I've struck a good compromise here. In particular MLJXGBoostWrapper.jl can be updated with very little effort.

New Dependencies

TODO

ablaom commented 1 year ago

@aviks This looks like a very valuable contribution, and the poster is a very capable and Julia-active developer. Given the scale of the changes (this is essentially a rewrite) it may be difficult to find someone willing to make a detailed review, and it would be a shame if that held this back. Since this will be tagged as breaking anyway, perhaps a shorter, testing-focused review would suffice?

aviks commented 1 year ago

Yeah, I'm not too worried about the size of the rewrite, I'll take a look in the daytime tomorrow.

My one concern is that on one hand we'll probably want such a large rewrite to be a new, breaking, major version number, but on the other hand, I've tried to be somewhat be aligned to upstream version numbers. Not sure what to do about that...something's gotta give.

ExpandingMan commented 1 year ago

I think the version number of this package should be considered a completely separate wrapper version number. It doesn't seem realistic to couple them in some way: even if you decided not to merge this you'd be stuck not making breaking changes for what would probably be a very long time as xgboost itself is quite stable, and clearly what's currently on master is very old in Julia terms.

For what it's worth, the Python package contains functionality that is not present in the base C library. Therefore they would have to either bump the version of the entire xgboost repo, have a stuck wrapper, or be rather disingenuous about semantic versioning. So, considering the content of the root library and wrappers coupling the versioning seems like a dubious proposition all around.

trivialfis commented 1 year ago

cc @dmlc/xgboost-committer If a rewrite is due, we might want to participate in this repository and see if there's anything we can help with.

ExpandingMan commented 1 year ago

Alright, this should do it.

Note that I have updated all of the CI/CD stuff with latest templates (testing, docs, tagbot, compathelper all via github actions). I'm always rather confused about how github handles that, I'm not sure if updated test and doc templates will run if you run the workflows from here.

I have set the version to 2.0. Again, I don't see any reasonable way around this. I'm willing to maintain a fork if you guys prefer, but it already sounds like there is an appetite for merging this.

Rather than updating the demo files I have added extensive documentation with Documenter.jl complete with API docs. It should more than cover what used to be covered in the demo files, so I didn't really see the point in keeping equivalents around. Of course if you feel the docs are lacking I can add more. (Again I have removed cross-validation, so there are no docs for that.)

Unit testing is not super extensive, obviously we are leaning heavily on libxgboost being well tested, but they should be somewhat more complete than they used to be, I'd like to think the coverage isn't too bad (though coverage always turns out to be less than I hope).

Note also that I set the documentation link to https://dmlc.github.io/XGBoost.jl which should be the default location for them.

Ok, so as far as I know this is done, so there won't be further actions from me until requested. Thanks!

codecov-commenter commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@f9793f3). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #111 +/- ## ========================================= Coverage ? 58.18% ========================================= Files ? 6 Lines ? 574 Branches ? 0 ========================================= Hits ? 334 Misses ? 240 Partials ? 0 ``` Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dmlc). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dmlc)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ExpandingMan commented 1 year ago

I believe I fixed the docs (there was a missing Project.toml).

No idea what's with coverage, any suggestions?

Also no idea why windows failed, I half expect it to be intermittent.

ExpandingMan commented 1 year ago

Same failure twice on windows. It's occurring in libxgboost could it be a binary builder issue?

ExpandingMan commented 1 year ago

Can we run this again just to check the CI/CD changes I just made? I'm also secretly hoping that the windows tests will magically succeed this time since I have absolutely no clue what can be done about their failure.

ExpandingMan commented 1 year ago

Yay! Thanks @aviks, didn't even occur to me that windows was not running on x86_64.

aviks commented 1 year ago

The previous CI run has failed due to a syntax error in the yml file, which is now fixed

The tests seem to pass in the linux CI run, however there are exceptions in the logs like so:

┌ Error: got error during C callback for resetting iterator
│   exception =
│    MethodError: no method matching reset!(::Base.Iterators.Stateful{Vector{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}}, Union{Nothing, Tuple{NamedTuple{(:X, :y), Tuple{Matrix{Float64}, Vector{Int64}}}, Int64}}})

Is that expected?

The Windows CI was originally defined for 32 bits, which xgboost does not support. on fixing that, it passes

aviks commented 1 year ago

That callback error -- seems to be 1.6 only.

ExpandingMan commented 1 year ago

Apparently there is a missing method (or rather missing default argument) for Iterators.reset! in 1.6 that has since been fixed. I have changed the method so that it shouldn't hit that error. You can try tests again when ready.

Btw, the reason that the 1.6 error was not actually resulting in a failure is because it is incredibly hard to test DMatrix objects since they are basically 1-directional (i.e. you can't get data back out). I suppose tests could be expanded to make it less likely that stuff like that would get through, but it would pretty much require training and testing models for every unit tests that involves DMatrix, which would be rather onerous.

trivialfis commented 1 year ago

Maybe this can help: https://github.com/dmlc/xgboost/pull/8269

ExpandingMan commented 1 year ago

Maybe this can help: dmlc/xgboost#8269

Yeah, would be really great if that gets merged, at which point we can add tests. DMatrix being so opaque is very harmful for testing.

trivialfis commented 1 year ago

It merged now.

ExpandingMan commented 1 year ago

I know this is a major overhaul so the request to review it is not a small one, but it's been a while since there's been any movement here so I'm bumping it.

trivialfis commented 1 year ago

what exactly xgboost does with that information is still mysterious to me)

XGBoost removes it.

Introspection of DMatrix

We will have 1.7 rc in near future. So maybe this PR can be based on newer version of XGBoost and utilize https://github.com/dmlc/xgboost/pull/8269

e.g. you can't run a data-loading iterator while training is happening

That happens inside XGBoost. But feel free to make suggestions.

objects now store feature names and non-default parameters. Unfortunately these can't be retrieved from the underlying C objec

They can be retrieved from the underlying c object. For feature names and types: https://xgboost.readthedocs.io/en/latest/c.html#_CPPv426XGBoosterGetStrFeatureInfo13BoosterHandlePKcP9bst_ulongPPPKc

Hyper-parameters are discarded during save_model, which is by design, see https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html . If you want to retrieve these parameter before it's being discarded, check out https://xgboost.readthedocs.io/en/latest/c.html#_CPPv423XGBoosterSaveJsonConfig13BoosterHandleP9bst_ulongPPKc

trivialfis commented 1 year ago

I can come back to this after 1.7 in xgboost if any help is needed. Will try to build some small models with this PR myself to get some intuition on how it works. In the mean while, feel free to ping me for any questions/suggestions.

rikhuijzer commented 1 year ago

So maybe this PR can be based on newer version of XGBoost and utilize dmlc/xgboost#8269

Sounds like that's a better idea for a next PR to avoid scope creep.

ExpandingMan commented 1 year ago

Are there any specific requests for changes in this PR?

If you are willing I'd prefer to address new features such as introspection of DMatrix in a future PR rather than dragging this one out.

tylerjthomas9 commented 1 year ago

I am pretty interested in this PR being merged. I am bumping it to see if anything else needs to be addressed in this PR.

aviks commented 1 year ago

OK, this PR is now merged. However, I'd still like a discussion on on version numbers. It seems to me that the python package keeps the same versioning as the C++ library, and they are released together.

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

aviks commented 1 year ago

Also, @ExpandingMan would you please take a quick look at the three PRs that are open in this repo. Would it be possible to migrate any of these to the new codebase? If not, we should close them, but I'd like your eyes on them before doing that.

rikhuijzer commented 1 year ago

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

Confusion is less bad than breaking in my opinion.

ExpandingMan commented 1 year ago

I have commented on these, see above. In summary: no none of them are directly applicable and they are somewhat easier to achieve now that a lower-level API is exported, however it might be nice in the future to more easily facilitate the functionality proposed in those PR's.

I don't know what's going on with documentation here... it's reporting that the job is completely successfully but all of the links are broken... anybody know what to do here? It's possible that an admin just has to declare it a used feature for this package.

ablaom commented 1 year ago

If we do that here, we are taking liberties with SemVer, but maybe that's a cost worth paying for not confusing end users? Opinions?

I agree that we should tag as breaking (new major version). Perhaps a compromise would be to add 100 to the major version number and preserve the minor/patch numbers to match the source library. So, source version = 1.5.2 means julia version = 101.5.2

ExpandingMan commented 1 year ago

In my opinion it's crazy to try to make the version numbers track each other in some way, how would that even look for future versions? It's a very common practice for non-trivial wrappers to have their own version numbers. Furthermore, the libxgboost version number is already visible and easily manageable in Julia via the jll. I don't see any downsides to decoupling the version numbers, it looks like an obvious choice to me.

aviks commented 1 year ago

I fundamentally disagree, and think this is quite user-hostile, but it doesn't seem like it's worthwhile to argue this point any more.