ML.Ranking LightGBM - Getting error "Value cannot be null. Parameter name: items"

dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

https://dot.net/ml

MIT License

9.04k stars 1.89k forks source link

ML.Ranking LightGBM - Getting error "Value cannot be null. Parameter name: items" #5022

Open abindh opened 4 years ago

abindh commented 4 years ago

System information

Windows:
.net Core 3.0:

Issue

I am trying to generate a simple ranking of candidates based on a few features for a recruitment application.
But when running the trainer, I get the following message which is not clear - "Value cannot be null. Parameter name: items"

Source code / logs

Capture

//Training Pipeline
            IEstimator<ITransformer> dataPipeline = mlContext.Transforms.Categorical.OneHotEncoding("HIGHESTEDUCATION", "HIGHESTEDUCATION")
                .Append(mlContext.Transforms.Categorical.OneHotEncoding("SOURCE", "SOURCE"))
                .Append(mlContext.Transforms.Text.FeaturizeText("SKILLSET", "SKILLSET"))
                .Append(mlContext.Transforms.Categorical.OneHotEncoding("TOWNCITY", "TOWNCITY"))
                .Append(mlContext.Transforms.Categorical.OneHotEncoding("YEARSOFEXPERIENCE", "YEARSOFEXPERIENCE"))
                .Append(mlContext.Transforms.Concatenate("Features", "HIGHESTEDUCATION", "SKILLSET", "SOURCE", "TOWNCITY", "YEARSOFEXPERIENCE"))
                .Append(mlContext.Transforms.Conversion.MapValueToKey("Label","Label"))
                .Append(mlContext.Transforms.Conversion.Hash("GroupId", nameof(Candidate.VACANCYID), numberOfBits: 20));

            // Set the LightGBM LambdaRank trainer.
            IEstimator<ITransformer> trainer = mlContext.Ranking.Trainers.LightGbm(labelColumnName: "Label", featureColumnName: "Features", rowGroupColumnName: "GroupId"); 
            IEstimator<ITransformer> trainerPipeline = dataPipeline.Append(trainer);

// Domain Model
public class Candidate
    {
        [LoadColumn(0)]
        public string HIGHESTEDUCATION { get; set; }

        [ColumnName("Label"),LoadColumn(1)]
        public Single RELEVANCESCORE { get; set; }

        [LoadColumn(2)]
        public string SKILLSET { get; set; }

        [LoadColumn(3)]
        public string SOURCE { get; set; }

        [LoadColumn(4)]
        public string TOWNCITY { get; set; }

        [ ColumnName("GroupId"), LoadColumn(5)]
        public string VACANCYID { get; set; }

        [LoadColumn(6)]
        public string YEARSOFEXPERIENCE { get; set; }

    }

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

antoniovs1029 commented 4 years ago

Hi. Can you please include the complete stack trace?

Also, if you could share a .zip containing your project and some sample data would make it easier to reproduce your problem, and also have a look on how are you loading the data. Thanks!

antoniovs1029 commented 4 years ago

So I got to reproduce your issue, by taking the data on your screenshot. The stack trace was as follows, and I will now look into this.

   at System.Collections.Immutable.Requires.FailArgumentNullException(String parameterName)
   at System.Collections.Immutable.ImmutableArray.Create[T](T[] items, Int32 start, Int32 length)
   at Microsoft.ML.Trainers.FastTree.RegressionTreeBase..ctor(InternalRegressionTree tree)
   at Microsoft.ML.Trainers.FastTree.RegressionTree..ctor(InternalRegressionTree tree)
   at Microsoft.ML.Trainers.FastTree.TreeEnsembleModelParametersBasedOnRegressionTree.<>c.<CreateTreeEnsembleFromInternalDataStructure>b__5_0(InternalRegressionTree tree)
   at System.Linq.Enumerable.SelectListIterator`2.ToList()
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Microsoft.ML.Trainers.FastTree.TreeEnsemble`1..ctor(IEnumerable`1 trees, IEnumerable`1 treeWeights, Double bias)
   at Microsoft.ML.Trainers.FastTree.RegressionTreeEnsemble..ctor(IEnumerable`1 trees, IEnumerable`1 treeWeights, Double bias)
   at Microsoft.ML.Trainers.FastTree.TreeEnsembleModelParametersBasedOnRegressionTree.CreateTreeEnsembleFromInternalDataStructure()
   at Microsoft.ML.Trainers.FastTree.TreeEnsembleModelParametersBasedOnRegressionTree..ctor(IHostEnvironment env, String name, InternalTreeEnsemble trainedEnsemble, Int32 numFeatures, String innerArgs)
   at Microsoft.ML.Trainers.LightGbm.LightGbmRankingModelParameters..ctor(IHostEnvironment env, InternalTreeEnsemble trainedEnsemble, Int32 featureCount, String innerArgs)
   at Microsoft.ML.Trainers.LightGbm.LightGbmRankingTrainer.CreatePredictor()
   at Microsoft.ML.Trainers.LightGbm.LightGbmTrainerBase`4.TrainModelCore(TrainContext context)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Issue5022.Program.Main(String[] args) in C:\Users\anvelazq\source\repos\Bugs\Issue5022\Program.cs:line 55

antoniovs1029 commented 4 years ago

So this seems to be indeed a bug on ML.NET. What happens is that the Tree that is returned by LightGBM has only 1 node, which is a leave, and whose value is 0. Because of this, the next code will leave tree = new InternalRegressionTree(2); (i.e. leafOutput[0] is 0).

https://github.com/dotnet/machinelearning/blob/41c5fc34f30f46541235369064fb5c9ccd3c6587/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L264-L275

So this InternalRegressionTree constructor is used; notice that RawThresholds is never initialized: https://github.com/dotnet/machinelearning/blob/41c5fc34f30f46541235369064fb5c9ccd3c6587/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs#L82-L96

Then when creating the RegressionTree here, it tries to use the RawThresholds array, which was never initialized, and it throws an exception saying that it's null: https://github.com/dotnet/machinelearning/blob/41c5fc34f30f46541235369064fb5c9ccd3c6587/src/Microsoft.ML.FastTree/RegressionTree.cs#L158-L171

I think the solution to this bug should be straight forward, so I will open a PR solving this.

antoniovs1029 commented 4 years ago

Still, it's noticeable that I get Trees with only one leave with "0" when training your data. It means that even after fixing this exception, everything you use as input of the predictor, will predict that your input is of rank "0". I'm not sure what the interpretation of this would be since in your training data you only have ranks from 1 to 4... perhaps that you need more data or play around with other parameters or preprocessing? @najeeb-kazmi any idea on this regard?

abindh commented 4 years ago

Firstly thank you taking time to look at this. Sorry for the delayed reply - I did not have access to internet for last couple of days.

Looking at your comments, I have now scaled the data to 100 rows and this works without any errors/exceptions. It would be good to get a better error message at least and also some details on minimum number of rows in training data to rank - I just din't know what to do when I got that error.

dasokolo commented 4 years ago

Thanks everyone for providing the details about this issue. I am also experiencing the same problem. @antoniovs1029 do you have the ETA for the fix?

antoniovs1029 commented 4 years ago

TL;DR: This isn't really a bug, and if anything what needs to be fixed is the exception that is thrown, for it to be more understandable to the user. If someone encounters this scenario, then there are 3 main things to try out: use a bigger dataset, set the MinimumExampleCountPerLeaf parameter to a smaller value, and/or presort your dataset by GroupId before passing it to ML.NET. If the problem persists, then it means that the LightGBM library/algorithm is unable to produce a relevant model for your given dataset, and it's better to explore other ranking algorithms in ML.NET.

Hi @dasokolo .

ML.NET provides a wrapper to the LightGBM library found here: https://github.com/Microsoft/LightGBM

As explained above, the problem on this issue is that LightGBM returns an empty tree as output of running a ranking task over the dataset. ML.NET doesn't handle very well this case of receiving an empty tree from LightGBM and throws a confusing exception. So the only thing I was planning to do was to throw a more readable exception, or return the empty tree. As mentioned above returning the empty tree won't be of much use either, since the output of such a tree is always 0 (i.e. not matter what input you use to your model, the prediction will be 0). So there's no ETA for a fix, as there's nothing to be fixed except for the exception message... the situation is simply that the LightGBM library / algorithm, returns an empty tree.

After talking with the maintainers of the LightGBM repo, I was told that, generally, LightGBM would return an empty tree for ranking, if the dataset isn't big enough. Nonetheless, there's no exact number of what "enough means", as it would be dependent of your specific dataset. Also, through experimentation, I've found that this exception might also occur if there are not enough samples per each GroupId... again, what "enough means" will change depending on the dataset.

Another thing that you can try if you're facing this problem, is playing around with the LightGBM parameters, particularly set MinimumExampleCountPerLeaf to a small number if the dataset is small.

Finally, after reading the code for how ML.NET preprocess the data before sending it to LightGBM for Ranking (link) I found out that ML.NET doesn't fully respect the GroupId provided by the user, and the only way for it to be fully respected, is if the user actually sort their dataset by GroupId before passing it to ML.NET. (Notice that ML.NET doesn't provide methods to sort dataviews, so the sorting will have to happen before passing it to ML.NET). This behavior of ML.NET of not fully respecting GroupIds isn't a bug, but actually it is intended and it's done for performance reasons while integrating with LightGBM's library. In general what happens is that if you have a dataset with 10 rows, with the following groupIds:

[1,1,1,2,2,1,1, 2, 2, 2]

ML.NET will actually only respect contiguous groupIds; i.e., the first 3 rows will belong to the first groupId, the next 2 rows will use to their own groupId, the next 2 rows will have yet another groupId, and the last 3 rows will belong to their own groupId. E.g, basically the GroupId's will look like this:

[1,1,1,2,2,3,3,4,4,4]

Since having few samples per each GroupId might also be related to LightGBM returning an empty tree for ranking, then presorting the dataset will help because then the input GroupIds will look like this:

[1,1,1,1,1,2,2,2,2,2]

even after ML.NET preprocess them.

justinormont commented 4 years ago

Yep, as stated, supporting only contiguous runs of GroupIDs is intentional. Beyond allowing for speed and streaming, this allows for the reuse of GroupIDs later in the dataset.

This is used when another user (or the same one later) reruns the same query. The hashed version of the query text is often used as the GroupID. Runs of GroupIDs are assumed to be from the same set of query results, and therefore are used to train the model and to calculate the NDCG metric.

In AutoML, to implement ranking, we can protect the user by checking the average run length of GroupIDs. We can do that in the initial sample of the dataset created to determine the ColumnPurpose of each column. Actual check would be: ensure the average number of rows in a run of GroupIDs is well over 1.0, and well below the size of the dataset. We can also check that the number of GroupIDs doesn't vary heavily across CV splits.

We may want to document the need of contiguous GroupIDs. A ranking dataset should remain in order; if shuffled, the GroupID must be respected and remain contiguous. Keeping the GroupID the same as the SamplingKeyColumn let the metrics remain useful by ensuring all parts of the same candidate query results set remain in a single split of the TrainTest or CV splits, and more strongly so if the query text is the origin of the GroupID column.

dasokolo commented 4 years ago

@antoniovs1029 & @justinormont, thanks for your prompt and detailed replies. Very much appreciated!

I was not very precise, when I wrote that I experience the same issue. I see the same error NullReferenceException in the same stack trace, but it happens when I do binary classification, not the ranking. Sorry about the confusion.

I myself have to work with user provided data in my service. I have no control over that data. So, I have to make a decision when to use LightGBM and when rely on other means.

It would be very helpful if you could provide a clear requirements of LightGBM for the training dataset (i. e. when it is guaranteed to work without exceptions).

Also as a longer term solution I would definitely recommend throwing some appropriate exception with the detailed error message instead of NullReferenceException and also documenting this requirement.

justinormont commented 4 years ago

@dasokolo: You may want to try the AutoML for binary classification (or multi-class). It's robust to sets of hyperparameters which fail for your dataset. It will continue and try other hyperparameters sets and trainers, with the goal of optimizing the metric you specify.

I myself have to work with user provided data in my service. I have no control over that data.

AutoML is also rather more robust to the changing data, as it will automatically fill in missing value and if the data changes significantly, can alter the feature engineering pipeline to match the data. Currently the column purpose detection is done once at the start of the process, though should the user's data change types or the stats on the column change a lot, the column can change purposes.

antoniovs1029 commented 4 years ago

Thanks for reporting this, @dasokolo . I hadn't encountered this exception when working with binary classification, so it's good to know it can happen also there. But I wonder, are you able to share with us the dataset and/or a subset of the dataset where this exception occur so that can I reproduce this? If not, can you please tell us the size of the dataset (MBs, number of rows and columns)?

In the case of Ranking from the original poster of this issue, and in my experimentations, it was the case that LightGBM returned only an empty tree (typically you'd expect as many trees as NumberOfIterations used in the LightGBM Options, but since ranking failed to produce a meaningful model, it returned only 1 tree and it was empty). In your case, I'm eager to know if you're actually getting many trees, one of which is empty... or if you are only getting 1 tree and it's empty. To confirm this I'd need to step into the code of ML.NET with your dataset in here, to know the number and shape of the trees: https://github.com/dotnet/machinelearning/blob/41c5fc34f30f46541235369064fb5c9ccd3c6587/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs#L189-L198

If it's actually returning multiple non-empty trees, then this actually a bug, and it should be straight forward to fix by removing the empty trees that cause the exception. If it is only returning empty trees, then the best I can do is to recommend trying to add more data into the dataset or play around with the LightGBM parameter I've mentioned.

It would be very helpful if you could provide a clear requirements of LightGBM for the training dataset (i. e. when it is guaranteed to work without exceptions).

Unfortunately, I don't think we're able to do that, as it would is dependent of the dataset.

Also as a longer term solution I would definitely recommend throwing some appropriate exception with the detailed error message instead of NullReferenceException and also documenting this requirement.

Will do in an upcoming PR.

dasokolo commented 4 years ago

@antoniovs1029 The classification error happens on very small datasets (~8 examples). Each example has a 256-dimensional float vector as a feature. I never saw this error on larger datasets.

antoniovs1029 commented 4 years ago

Oh, I see. Then I guess it's certainly this the problem, as your datasets are very small to produce a meaningful (i.e. non empty) model.

justinormont commented 4 years ago

@dasokolo: I agree w/ @antoniovs1029, you'll want a larger dataset. AutoML may make a model on your tiny dataset, though it's unlikely to be useful. You can see its row limits here: https://github.com/dotnet/machinelearning-modelbuilder/issues/638#issuecomment-614425998

As the users provide these tiny datasets, if your privacy rules allow, and each user's dataset is similar, you could merge multiple user's datasets to learn across multiple users' data and add a categorical feature column indicating which user the data came from. The categorical column would allow the merged model to customize its answer for each user, while learning across users letting the model learn shared information between users.

antoniovs1029 commented 4 years ago

I'm changing this P1 to P3 as it is not a bug for the reasons explained here some months ago (TL;DR, this is expected behavior from the LightGBM algorithm). Also, as explained on that comment and also in other comments of this thread, the only work remaining would be to throw a more understandable exception message (which I consider to be P3, documentation issue), to somehow give the advice also given on this thread.