Revise predict to allow all 'type'

dmlc / XGBoost.jl

XGBoost Julia Package

Other

288 stars 111 forks source link

Revise predict to allow all 'type' #172

Open bobaronoff opened 1 year ago

bobaronoff commented 1 year ago

Theses changes would allow a user access to a feature of libxgboost that reports feature contributions and/or interactions on the record level. This can be useful for Shapley type analyses. The additional data is obtained via specification of the 'type' parameter in Lib.XGBoosterPredictFromDMatrix().

A 'type' parameter is added to predict(); Input values are 0 through 6. The meaning associated with each parameter value is added to the docstring. This is an optional parameter with a default of 0 which provides normal output.

Was not certain what to do with old parameter 'margin' - I removed it as is redundant to the 'type' specification although might cause issue if others have it in and code.

There is significant variation in the dimensions of data returned dependent on the 'type' and booster objective ( multi class objectives return and extra dimension). 'type' 2 and 3 return 2 dimension array. 'type' 4 and 5 return 3 dimensional array. transpose() fails on 3 dimensional array and is replaced with permutedims(). This creates a trade-off in that permutedims() reallocates memory for array although the Matrix Type is more robust than the Transpose Type. For normal prediction(i.e. 'type'=0 where return is vector), there is no additional allocation so this should not impact operations that call predict many times ( for the creation of learning curves/cross validation).

Bob

bobaronoff commented 1 year ago

I apologize. This is third submission. I am unable to figure out how to squash my commits in order for them to be reviewed. This is my first PR ever.

ExpandingMan commented 1 year ago

Don't worry, we can work with you to get it through. You seem to understand what needs to be done here so I'm sure it won't take us long.

Some comments here:

We can't remove margin, as this would be a breaking change. I think what we should do is that type should be a Union{Nothing,Integer} which defaults to nothing, in which case only margin is used. If type is set to an Integer instead it should override margin. In the future, I think we should add more keyword arguments, if we make type a Union{Nothing,Integer} we can keep it permanently and just have it override the keyword arguments when it's not default.
I don't think the value of type should be checked in Julia code at all, libxgboost should simply throw an error when it's an invalid value which is correct behavior. We shouldn't silently "fix" invalid values. Therefore the if block for type can be completely removed.
Can you explain in more detail why you feel transpose should be changed? I don't think this is a good idea. transpose is non-allocating, but permutedims is. Therefore, this change would make predict much more expensive. Again, I don't see any reason why the existing code for constructing the output should not be sufficient for any value of type (though admittedly I don't understand what most of the type values mean).

bobaronoff commented 1 year ago

I will work on a new version - should not take long. I do not understand how to submit a revised PR as opposed to creating a new PR. Am doing this via GitHub web interface as I am flummoxed by the git command line interface.

Easy to place margin back in and take out the parameter checking lines.

I understand the implications of permutedims vs transpose. I can best illustrate the need with following snippet.

julia> mat3d=ones(5,5,5);

julia> transpose(mat3d)
ERROR: transpose not defined for Array{Float64, 3}. Consider using `permutedims` for higher-dimensional arrays.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] transpose(a::Array{Float64, 3})
   @ LinearAlgebra /Applications/Julia-1.8.4.app/Contents/Resources/julia/share/julia/stdlib/v1.8/LinearAlgebra/src/transpose.jl:4
 [3] top-level scope
   @ REPL[6]:1

The return shape for predict type of 4 and 5 has dims=3 ( there is a 2 dimensional array for each feature). If transpose is left in place the function will throw an error for those circumstances. Admittedly, these feature will probably not be used to a great extent but since type of 4 and 5 are supported by libxgboost I thought it should be an option for users. If you remember, transpose was added when I detected that predictions for multi:softprob were 'scrambled'. With the correction, the return data is Transpose (instead of Matrix). Many of the operations I wanted to perform that work on Matrix did not work for Transpose and I had to convert to Matrix anyway which is reallocating. The bottom line is that sometimes the return data will have dims=3 and transpose will fail. As the function is written, when return has one dimension (i.e. type 0,1 and all objectives except multi:) there is no call to transpose ( or permutedims). This is probably the most common and has no added allocation expense. I suppose there could be differing adjustments for 1(i.e. none), 2 (i.e.,transpose), and 3(i.e.permutedims) dimensional arrays but this makes life complicated for the user to keep the return Types straight - I think they're likely to need to convert Transpose to Matrix anyway. I will defer to whatever you think best.

bobaronoff commented 1 year ago

I was able to make some modifications. They are showing above so I guess I don't need to do anything special to revise PR.

margin is re-added. Default for type is nothing. When type is set to a value it will over-ride margin. Docstring revised.

Held off on transpose vs. permutedims until you have chance to review my comments above and advise how you'd like me to proceed.

Bob

ExpandingMan commented 1 year ago

Gah, I did not realize that some of the outputs were higher rank. That's kind of a big problem, now we are going to need to be really careful.

I definitely don't think we can just go and do permutedims by default, it's just too expensive for like 99% of use cases. I suppose we could check for dims == 2... my only concern with that is that it would make this function "even more type unstable", though this might be a foolish worry since it is already essentially Union{Array{Float32},Transpose{<:Array{Float32}}...

Anyway, this otherwise looks good to me. Would you like to change this so that it does transpose for dims == 2, and permutedims for dims > 2? I'm not 100% decided that this is what we should be doing yet, but I don't think we are going to have much choice.

bobaronoff commented 1 year ago

I think it best if we leave the current 'predict' alone and rename my version , I chose 'predictbytype'. It avoids all the issues you correctly point out. Altering 'type' is only for the Shapley values and up until now I am the only one that has asked. This would be a function you could point the occasional user to if they request Shapley; if enough ask then can consider rolling it in to one function. If you think a separate function is a bad idea that is okay. Studying your code, I've gotten better at writing personal functions that leverage Lib.jl directly :-)

As an aside, many of the current XGBoost papers (in the medical literature) are showing Shapley data; it has come to be expected. Leveraging the libxgboost routines makes these analyses a very simple endeavor. To my eye, Shapley based dependency and partial dependency plots are qualitatively very similar but the train on the tracks - how can you argue with 'game theory'.

bobaronoff commented 1 year ago

Following up on your reply ( although don't see above). My thinking is that having a function return three different types makes a complex scenario even worse.

90+ % of cases the return rank will be 1 and there will be no transpose or reallocation. Under 'normal' use the return rank will be 2 only for multi-class models. transpose 'fixes' the multi-class return without re-allocation.

What came up for me though is that the return type is LinearAlgebra.Transpose and not Matrix (not a subtype!). Operations/functions that work with 'Matrix' do not necessarily work with 'LinearAlgebra.Transpose'; I needed to convert to Matrix which I assume reallocated memory so in my situation it was 'pay me now or pay me later'.

This could be due to my inexperience. I am not familiar with MLJ in general or in the specifics of how it implements multi-class XGBoost models particularly with cross validation ( where predict is called repeatedly). Assuming it functions fine with LinearAlgebra.Transpose (without converting to Matrix) then transpose is the only way to go.

Technically transpose vs. permutedims is based on rank, however in the context of this function, it is dependent on 'type' (types 0,1 => transpose , types 2,3,4,5 => permutedims). When things get complicated my tendency is if it isn't broken, why fix it? The current 'predict' is working for users. This is why I think to start, best to have two functions, one that is not reallocating and one that is. Unless you're after Shapley values there is no need for user to change a thing.

I have taken up too much of your time.

ExpandingMan commented 1 year ago

What came up for me though is that the return type is LinearAlgebra.Transpose and not Matrix (not a subtype!). Operations/functions that work with 'Matrix' do not necessarily work with 'LinearAlgebra.Transpose'; I needed to convert to Matrix which I assume reallocated memory so in my situation it was 'pay me now or pay me later'.

Transpose is an AbstractMatrix, programs that use the output of predict should not expect a Matrix in particular and I don't think the package should guarantee the return type is an Array. I believe using transpose in this case is entirely appropriate and an intended use case.

That said, it's not great that the output of predict is not type stable, which it is already not since it is an Array if it is rank-1, however, we don't have many options here since we can't control the memory layout of the prediction output, and the return is already type unstable, so as I said earlier, it might be silly to still be worrying about it.

I do think perhaps we should provide functions which are type-stable even if predict is not. That said, there should only be one method that calls XGBoosterPredictFromDMatrix, any other methods can depend on that.

Suffice it to say, I see two options, I have not decided on which is preferable:

Simply check dims in predict and return a Vector, a Transpose or do permutedims if we are forced to.
Refactor predict so that there is a low level method which calls XGBoosterPredictFromDMatrix, the existing predict method (for handling rank 1 and 2) and add one or more new functions which handle the other return types for other values of type.

We should probably do the latter but again, it's not like predict will ever be truly type-stable so maybe this is just me being silly and pedantic.