Important caveat converting DataFrame to DMatrix

bobaronoff commented 1 year ago

Perhaps this is self evident to others but has taken a few days for me to sort out. Have made good progress training with cross validation and testing for generalization and performance. However, was getting 'scrambled' results when scoring new 'unknowns'.

Long story short, when converting a DataFrame to DMatrix for purpose of XGBoost.predict, the order of the DataFrame columns must match the order in the booster.feature_name property. In the 'predict' function, libxgboost appears to rely on the booster object to determine how to 'assign' each column in the predict DMatrix data.

This was not immediately evident due to the train/test sets being split from same initial DataFrame; new data does not always guarantee the same column order and needs explicit management - fortunately this is just a single select! statement.

ExpandingMan commented 1 year ago

I agree this is problematic as it seems like column names should be respected. It's not quite obvious what would be the best way to fix this since the libxgboost objects don't store that kind of metadata, though Booster now stores feature names.

I agree that this should "just work" or, at worst, throw an error.

bobaronoff commented 1 year ago

just a thought .....

Looking at booster.jl and dmatrix.jl, it appears that the feature_name property of the Booster object is set by calling getfeaturenames(dm) which calls getfeatureinfo(dm) which calls libxgboost. It looks to me that the DMatrix does store the data column names. Perhaps XGBoost.predict could compare the booster feature_names property to the getfeaturenames on the DMatrix object. If the names are out of order or some are missing then an error message can be thrown. I sure wish libxgboost had an api manual :-)

I apologize for my sloppy code but this is the gist of what I am thinking.

fnames1=booster.feature_names
fnames2=getfeaturenames(dm)
if length(fnames1)!=length(fnames2) || sum(fname1 .!= fnames2)>0
    error("prediction data column names and/or order does not match booster")
end

bobaronoff commented 1 year ago

The above suggestion is for scenarios where XGBoost.predict is called with a DMatrix object. This can only throw an exception but not fix the problem. 'predict' can only fix the problem if it is called with (booster, df::DataFrame).

possible code might include something like ...

xnames=setdiff(booster.feature_names,names(df))
if length(xnames)>0
   msg="prediction data missing columns: " * join(xnames,",")
   error(msg)
end
if length(booster.feature_names)==length(names(df)) && sum(booster.feature_names .!= names(df))==0
   dm=DMatrix(df)
   XGBoost.predict(booster,dm)
else
   df2=copy(df)
   select!(df2,Symbol.(xnames))
   dm=DMatrix(df2)
   XGBoost.predict(booster,dm)
end

The downside is the need to make a copy of the DataFrame so that XGBoost.predict does not alter the original. However, more than likely the user will need to make copy unless they are fine with altering the original DataFrame. One way or another, columns need to coincide in order to score.

bobaronoff commented 1 year ago

need to correct a typo in above code (see fourth line from bottom)::

xnames=setdiff(booster.feature_names,names(df))
if length(xnames)>0
   msg="prediction data missing columns: " * join(xnames,",")
   error(msg)
end
if length(booster.feature_names)==length(names(df)) && sum(booster.feature_names .!= names(df))==0
   dm=DMatrix(df)
   XGBoost.predict(booster,dm)
else
   df2=copy(df)
   select!(df2,Symbol.(booster.feature_names))
   dm=DMatrix(df2)
   XGBoost.predict(booster,dm)
end

dmlc / XGBoost.jl

Important caveat converting DataFrame to DMatrix #155