fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
552 stars 43 forks source link

negative_binomial not supported error, Python version 1.2.7.1 (from PyPi) #131

Closed m-haines closed 7 months ago

m-haines commented 7 months ago

Further from an earlier issue I posted (which was user error), I thought I would try out a few of the other likelihoods for a simple toy model, just so that I know what I am doing in the future. The code is roughly, as before, except the response variable, y, is converted to the int type prior to this code block:

        coords_train = np.array([x_train["Northings"].to_numpy(), x_train["Eastings"].to_numpy()])

        gp_model = gpb.GPModel(gp_coords=coords_train.transpose(), cov_function="exponential",
                likelihood="negative_binomial, gp_approx="vecchia")

        x_train = x_train.drop(["Northings", "Eastings"], axis=1)
        data_train = gpb.Dataset(x_train, y_train)  # y_train int for neg binom

        params = { 'lambda_l2': 1, 'learning_rate': 0.01,
                    'max_depth': 3, 'min_data_in_leaf': 20, 
                    'num_leaves': 2**10, 'verbose': 0}

        mod = gpb.train(params=params, train_set=data_train,
                gp_model=gp_model, num_boost_round=247)

However, the negative_binomial gives the following error:

[GPBoost] [Fatal] Likelihood of type 'negative_binomial' is not supported.

I think this might be because in

/GPBoost//src/LightGBM/objective/regression_objective.hpp,

line 212 ish, likelihood type "negative_binomial" is not listed, which leads to the error message given.

void ConvertOutput(const double* input, double* output) const override {
            if (has_gp_model_) {
                // Note: this is needed for calculation/evaluation of metrics
                // This is done directly here and not via the re_model_ and its likelihood to avoid overhead
                if (likelihood_type_ == std::string("gaussian")) {
                    output[0] = input[0];
                }
                else if (likelihood_type_ == std::string("bernoulli_probit")) {
                    output[0] = GPBoost::normalCDF(input[0]);
                }
                else if (likelihood_type_ == std::string("bernoulli_logit")) {
                    output[0] = 1. / (1. + std::exp(-input[0]));
                }
                else if (likelihood_type_ == std::string("poisson") ||
                    likelihood_type_ == std::string("gamma")) {
                    output[0] = std::exp(input[0]);
                }
                else {
                    Log::Fatal("ConvertOutput: Likelihood of type '%s' is not supported.", likelihood_type_.c_str());
                }
            }

Is this intentional? Although once again, it could be user error or a lack of understanding on my part.

Thank you once again for your help.

fabsig commented 7 months ago

Thanks for reporting this. I can reproduce this when setting the option use_gp_model_for_validation=False in the gpb.cv() or grid_search_tune_parameters() function. A fix for this is on GitHub and will be on PyPI soon.

Note that it is not recommended to use use_gp_model_for_validation=False. Rather use use_gp_model_for_validation=True, since you will likely also use the gp_model part for making predictions.

m-haines commented 7 months ago

Thank you for the information and advice on use_gp_model_for_validation=True, I have been using that, as I do use the gp_model part for predictions.

I should have mentioned this before, but are you able to reproduce it when trying to initiate a standalone GPModel class, with the negative_binomial option? As given below? As that is when the error occurs for me.

gp_model = gpb.GPModel(gp_coords=coords_train.transpose(), cov_function="exponential",
                                            likelihood="negative_binomial, gp_approx="vecchia")

Although from what you have said, I think the fix should resolve the issue. I look forward to trying it soon.

fabsig commented 7 months ago

I can only reproduce this error when I set use_gp_model_for_validation=False and e.g. metric="mse". Also, the corresponding code in regression_objective.hpp should only be called when use_gp_model_for_validation=False. Note that you are missing a quotation mark in your code. It should be:

gp_model = gpb.GPModel(gp_coords=coords_train.transpose(), cov_function="exponential",
                                            likelihood="negative_binomial", gp_approx="vecchia")
fabsig commented 7 months ago

I just realized that your error message [GPBoost] [Fatal] Likelihood of type 'negative_binomial' is not supported. is not from he file regression_objective.hpp. There it would say. [GPBoost] [Fatal] ConvertOutput: Likelihood of type 'negative_binomial' is not supported.

That leaves me a little puzzled as the reason must be something else. Can you provide a reproducible example including data?

m-haines commented 7 months ago

I can, yes. The example below reproduces the error on my machine. It uses the house_sales data from geodatasets. I have put it together quickly, so the negative binomial might be a poor choice of likelihood for the dataset.

import geopandas  # version 0.14.2
import geodatasets  # version 2023.12.0
import gpboost as gpb  # version 1.2.7.1
import numpy as np  # version 1.26.3

home_sales = geopandas.read_file(geodatasets.get_path("geoda.home_sales"))
home_sales_coords = home_sales.get_coordinates()
home_sales["x"] = home_sales_coords["x"]
home_sales["y"] = home_sales_coords["y"]

# Remove duplicate coord values from home_sales so non-Gaussian likelihood can be fitted by GPBoost,
# keeping the more expensive house at that location
home_sales = home_sales.sort_values(by=['price'])
home_sales = home_sales.drop_duplicates(subset=["x", "y"], keep='last')
home_sales = home_sales.sort_index()
home_sales = home_sales.reset_index(drop=True)

coords_train = np.array([home_sales["x"].to_numpy(), home_sales["y"].to_numpy()])

gp_model = gpb.GPModel(gp_coords=coords_train.transpose(), cov_function="exponential",
                       likelihood="negative_binomial", gp_approx="vecchia")

x_train = home_sales[["bedrooms", "bathrooms", "sqft_liv", "sqft_lot", "floors", "view"]]
y_train = home_sales["price"]
data_train = gpb.Dataset(x_train, y_train)

params = { 'lambda_l2': 1, 'learning_rate': 0.01,
                    'max_depth': 3, 'min_data_in_leaf': 20, 
                    'num_leaves': 2**10, 'verbose': 0}

mod = gpb.train(params=params, train_set=data_train,
                gp_model=gp_model, num_boost_round=247)

If it helps, using Github search, the only other place I can find which mentions a "Likelihood of type" error is ./include/GPBoost/likelihoods.h, line 86 and 187, an example for context is below:

        Likelihood(string_t type,
            data_size_t num_data,
            data_size_t num_re,
            bool has_a_vec,
            bool use_Z_for_duplicates,
            const data_size_t* random_effects_indices_of_data) {
            string_t likelihood = ParseLikelihoodAlias(type);
            likelihood = ParseLikelihoodAliasGradientDescent(likelihood);
            if (SUPPORTED_LIKELIHOODS_.find(likelihood) == SUPPORTED_LIKELIHOODS_.end()) {
                Log::REFatal("Likelihood of type '%s' is not supported.", likelihood.c_str());
            }

However, it looks as though "negative_binomial" is in the list of SUPPORTED_LIKELIHOODS, so I didn't mention it yesterday as I didn't think that was the issue.

fabsig commented 7 months ago

Version 1.2.7.1 does not yet support the negative binomial likelihood. This is a relatively new feature and, unfortunately, I have not released any updates on PyPI for some time. As of today, version 1.3.0 is on PyPI which supports the negative binomial likelihood.

m-haines commented 7 months ago

Apologies for the slow reply, I was away for a few days. All is working now, thank you for deploying 1.30 to PyPi.