OnlineGradientDescent throws exception

PeterPann23 commented 5 years ago

System information

Microsoft Windows Pro version 10.0.17763, 64GB RAM, I7-7700K 4 physical cores 4.2 GHz, 2x 250 GB M2 Drives, AMD FirePro W5100 with 4096 MB/930 MHz
.Net Version 4.72, Microsoft.ML 0.9.0 Wednesday, January 9, 2019 (1/9/2019)
Dataset 3,378,393 rows

Issue

What did I do

Comparing the prediction accuracy using
1. same data source
2. same normalisation
3. with different trainers

I configured the estimator chain like so:

var dataProcessPipeline = mlContext.Transforms.CopyColumns("predictField", "Label")
.Append(mlContext.Transforms.Normalize(inputName: "SH1", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
.Append(mlContext.Transforms.Normalize(inputName: "SL1", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
… 665 more
.Append(mlContext.Transforms.Normalize(inputName: "SH9", mode: NormalizingEstimator.NormalizerMode.MeanVariance))
.Append(mlContext.Transforms.Concatenate("Features","SH1",..."SH9"));
dataProcessPipeline.AppendCacheCheckpoint(mlContext);

Previously I had 119 data points in the model and had no error.

I test the models based on the parameter telling it what network to learn, the item causing the error is this

else if (Definition.MachineLearningMethod == AI.ML.Factory.MachineLearningMethods.OnlineGradientDescent)
                    {
                        var trainer = mlContext.Regression.Trainers.OnlineGradientDescent(labelColumn: "Label"
                                                                                        , featureColumn: "Features"
                                                                                        , advancedSettings: a =>
                                                                                        {                                                                                            
                                                                                            a.DecreaseLearningRate = true;
                                                                                            a.DoLazyUpdates = true;
                                                                                            a.NormalizeFeatures = NormalizeOption.Yes;                                                                                           
                                                                                            a.DecreaseLearningRate = true;
                                                                                            a.Caching = Microsoft.ML.EntryPoints.CachingOptions.Memory;                                                                                            
                                                                                        }
                                                                                        );
                       var trainingPipeline = dataProcessPipeline.Append(trainer);
                       return trainingPipeline.Fit(trainingDataView);

What happened? After I call Fit on my Training Data view I see following errors Exception thrown: 'System.InvalidOperationException' in Microsoft.ML.StandardLearners.dll then Exception OnlineGradientDescent:The weights/bias contain invalid values (NaN or Infinite). Potential causes: high learning rates, no normalization, high initial weights, etc

after, I think, the .net framework throws an error in my running test (no debugger attached)

Managed Debugging Assistant 'ContextSwitchDeadlock' The CLR has been unable to transition from COM context 0x248b5058 to COM context 0x248b5180 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.

What did you expect? Having been able to run the network without any of the advanced using a smaller dataset and receiving the error I added the Advanced settings hoping to be able to solve the issue. this however is not the case.

Source code / logs

:

_[Source=NormalizingEstimator; RowToRowMapperTransform; Cursor, Kind=Trace] Channel finished. Elapsed 00:04:53.5139276. [Source=NormalizingEstimator; RowToRowMapperTransform; Cursor, Kind=Trace] Channel disposed [Source=ColumnConcatenatingEstimator ; RowToRowMapperTransform; Cursor, Kind=Trace] Channel finished. Elapsed 00:04:53.4765765. [Source=ColumnConcatenatingEstimator ; RowToRowMapperTransform; Cursor, Kind=Trace] Channel disposed [Source=ColumnConcatenatingEstimator ; RowToRowMapperTransform; Cursor, Kind=Trace] Channel finished. Elapsed 00:04:53.4197884. [Source=ColumnConcatenatingEstimator ; RowToRowMapperTransform; Cursor, Kind=Trace] Channel disposed [Source=Stochastic Gradient Descent (Regression); Training, Kind=Trace] 2/4/2019 2:59:47 PM Finished training iteration 1; iterated over 3412517 examples. [Source=Stochastic Gradient Descent (Regression); Training, Kind=Trace] Channel finished. Elapsed 00:04:56.6368673. [Source=Stochastic Gradient Descent (Regression); Training, Kind=Trace] Channel disposed

Exception OnlineGradientDescent:The weights/bias contain invalid values (NaN or Infinite). Potential causes: high learning rates, no normalization, high initial weights, etc. Exception:The weights/bias contain invalid values (NaN or Infinite). Potential causes: high learning rates, no normalization, high initial weights, etc. testhost.exe Error: 0 : The weights/bias contain invalid values (NaN or Infinite). Potential causes: high learning rates, no normalization, high initial weights, etc._

full log is attached Learning exception.zip

Ivanidzo4ka commented 5 years ago

Unrelated to your question, but worth to mention. You can combine your features together with concat and appy normalization to them, rather than apply normalization for each of them and then concat. In first case we need to pass through data once, in second case we need to pass through data for each column. So nine times in your case.

You do use advanced options, but some of them is actually unrelated to trainer and become hidden in 0.10. Like Caching or NormalizeFeatures. DecreaseLearningRate is turn on for this trainer by default, so it change doesn't make any difference.

I've run locally on some internal datasets and as soon as I'm using MeanVar normalization, I get awful models, mostly because this algorithm expects features to be in [0;1) range, while MeanVar can produce values out of that range.

So my guess would be, we just run into overflow situation because of not expected normalization and one of the weight or bias for model become Infinite value. Can you try to use min-max normalization instead.

PeterPann23 commented 5 years ago

Hi Ivan

Yes, I tried the MinMax, option did not solve my issue and none of my values in the training data contain anything but valid floats.

Ivanidzo4ka commented 5 years ago

Sorry for late response, any way you can share your data and pipeline? I would suggest to try different learning rates and other parameters. Learner just refuse to converge on your dataspace, it's hard to say why it's happening without looking on the data.

PeterPann23 commented 5 years ago

Sure,

I have removed the utility code that manages the BL in my app to something that "makes sense",The pipeline is attached. There is a limitation on size so data will be an issue as it is 13.803.613 KB I can't generate a binary version of the training file as it bluescreens my station when I do that.

pipelibe.txt

I will update and provide a link to the training file when it uploaded to the cloud

PeterPann23 commented 5 years ago

Hi Ivan

Actually I have 665 columns, how would I go and Combine them and then normalise them?

Peter

Von: Ivan Matantsev notifications@github.com Gesendet: Tuesday, February 5, 2019 11:28:21 PM An: dotnet/machinelearning Cc: PeterPann23; Author Betreff: Re: [dotnet/machinelearning] OnlineGradientDescent throws exception (#2407)

Unrelated to your question, but worth to mention. You can combine your features together with concat and appy normalization to them, rather than apply normalization for each of them and then concat. In first case we need to pass through data once, in second case we need to pass through data for each column. So nine times in your case.

You do use advanced options, but some of them is actually unrelated to trainer and become hidden in 0.10. Like Caching or NormalizeFeatures. DecreaseLearningRate is turn on for this trainer by default, so it change doesn't make any difference.

I've run locally on some internal datasets and as soon as I'm using MeanVar normalization, I get awful models, mostly because this algorithm expects features to be in [0;1) range, while MeanVar can produce values out of that range.

So my guess would be, we just run into overflow situation because of not expected normalization and one of the weight or bias for model become Infinite value. Can you try to use min-max normalization instead.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fmachinelearning%2Fissues%2F2407%23issuecomment-460829344&data=02%7C01%7C%7Cb24ea98eb60445ac9a9608d68bb93cba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636850025028320055&sdata=jfAKGyq67LBIYbFUuGdVy4yfNqBxyKE%2BmfL425Vz0L0%3D&reserved=0, or mute the threadhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAqWAtuZ3FCrvBZN109p_sCvNTmw820r5ks5vKgWFgaJpZM4aiBVa&data=02%7C01%7C%7Cb24ea98eb60445ac9a9608d68bb93cba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636850025028330060&sdata=0D%2FzLyyKqGg2M4gOoD4lKEASBHjycy7qx13gumINQOg%3D&reserved=0.

Ivanidzo4ka commented 5 years ago

var data = mlContext.Data.ReadFromEnumerable(enumerableOfData);
var allNonLabelColumns = data.Schema.Where(x => x.Name != "Label" && !x.IsHidden).Select(x => x.Name);
var pipeline = mlContext.Transforms.Concatenate("Features", allNonLabelColumns.ToArray()).Append(
            mlContext.Transforms.Normalize("Features"));

Feel free to play with Schema object, you can do various filtration over it, filter by type, name, index position.

Ivanidzo4ka commented 5 years ago

Notice what I combine all non labels columns into one column and only after that I applied normalization to it. This is proper template to process data. If you do it in opposite way, we need to fetch one column from your source, process it, and then go to another column and so one, if you have 665 columns we will iterate over your dataset 665 times. If we combine them first into one column, we will do only one pass over data.

PeterPann23 commented 5 years ago

Thanks, Will test the output an see if I get similar results.

dotnet / machinelearning