bigdong89 / xgboostExtension

xgboost Extension for Easy Ranking & TreeFeature
Apache License 2.0
123 stars 35 forks source link

Scorer super init, should be dict #8

Open tias opened 6 years ago

tias commented 6 years ago

https://github.com/bigdong89/xgboostExtension/blob/18d4b8bb5fc0dfb13f2b2af590fd5adc7f5232c8/xgboostextension/scorer/__init__.py#L25

On the above line, should be {} instead of None, otherwise pickling (for parallel CV) will still not work.

Also note that this functionality is not present in the pip packages, I think the classes are not registered for packaging or smth.

Finally, I find the multiple classes quite too much, the same can be achieved with a single class in the style of:

class RankingScorer(_BaseScorer):
    def __init__(self, score_func, sign=1, k=None, stats=False):
        """
        Base class for applying scoring functions to ranking problems.
        This class transforms a ranking metric into a scoring function
        that can be applied to estimations that take a group indicator in
        their first column.

        Parameters
        ----------
        sign: maximize/minimize
        k: optional, if @k scorer
        stats: metric from scipy.stats, e.g. only use [0] element
        """
        super(RankingScorer, self).__init__(
            score_func,
            sign,
            {}
        )
        self.k = k
        self.stats = stats

    # First column of X has to be the group indicator
    def __call__(self, estimator, X, y, sample_weight=None):
        group_labels = X[:,0]
        (sizes, order_ixs) = group_col2len(group_labels)

        # reorder X and y (assumption of predictor)
        X = X[order_ixs, 1:]
        y = y[order_ixs]

        y_predicted = estimator.predict(X)
        return self.val(sizes, y, y_predicted)

   # also easy for non-sklearn access
    def val(self, sizes, y_true, y_pred):
        # indices at which to split
        split_indices = np.cumsum(sizes)[:-1]

        # Split into a different set for each query
        y_true_split = np.split(y_true, split_indices)
        y_pred_split = np.split(y_pred, split_indices)

        # Go through each query and calculate the statistic's value
        results = np.zeros(sizes.shape[0])
        for i, (yg_true, yg_pred) in enumerate(zip(y_true_split, y_pred_split)):
            r = None
            if self.k == None:
                r = self._score_func(yg_true, yg_pred)
            else:
                r = self._score_func(yg_true, yg_pred, self.k)

            if self.stats:
                r = r[0] # scipy.stats returns (val, significance)
            results[i] = r

        # Return the average over all statistics
        return self._sign * results.mean()
FlorisHoogenboom commented 6 years ago

You are right about the first point! I've noticed that #7 introduced some other small bugs. I'll create a new PR to fix those.

Regarding the PyPi package: I think this new functionality has not yet been published to PyPi by @bigdong89.

For the last point, I would be fine merging the _make_grouped_metric into the Scorer class like you propose. I would support a PR for that.

bigdong89 commented 6 years ago

Going to publish it to PyPi

bigdong89 commented 6 years ago

@FlorisHoogenboom
going to add you as collaborator on pypi. I need your pypi accounts

FlorisHoogenboom commented 6 years ago

Sure, FlorisHoogenboom

bigdong89 commented 6 years ago

new package has been uploaded to pypi. However, I have no py3 environment issue #10