dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

AutoML: 'System.OperationCanceledException' after upgrade to AutoML 0.20.1 #6706

Closed NiPersson closed 9 months ago

NiPersson commented 1 year ago

After upgrading to 0.20.1 I get an "Operation was canceled" exception after a while during training. I tried tweaking the code but could not get rid of it and it persisted when I tried a simple example as shown below:

    ' Create MLContext
    Dim dataPath As String = "C:\taxi-fare-train.csv"
    Dim ctx As MLContext = New MLContext()

    ' Infer column information
    Dim columnInference As ColumnInferenceResults = ctx.Auto().InferColumns(dataPath, labelColumnName:="fare_amount", groupColumns:=False)

    ' Create text loader
    Dim loader As TextLoader = ctx.Data.CreateTextLoader(columnInference.TextLoaderOptions)

    ' Load data into IDataView
    Dim data As IDataView = loader.Load(dataPath)

    ' Split into train (80%), validation (20%) sets
    Dim trainValidationData As TrainTestData = ctx.Data.TrainTestSplit(data, testFraction:=0.2)

    'Define pipeline
    Dim Pipeline As SweepablePipeline = ctx.Auto().Featurizer(data, columnInformation:=columnInference.ColumnInformation).Append(ctx.Auto().Regression(labelColumnName:=columnInference.ColumnInformation.LabelColumnName))

    ' Create AutoML experiment
    Dim experiment As AutoMLExperiment = ctx.Auto().CreateExperiment()

    ' Configure experiment
    experiment.SetPipeline(Pipeline).SetRegressionMetric(RegressionMetric.RSquared, labelColumn:=columnInference.ColumnInformation.LabelColumnName).SetTrainingTimeInSeconds(60).SetDataset(trainValidationData)

    ' Run experiment
    Dim experimentResults As TrialResult = experiment.Run
    Dim model = experimentResults.Model

The exceptions comes at "Dim experimentResults As TrialResult = experiment.Run".

LittleLittleCloud commented 1 year ago

Have you tried adding more time or use SetMaxModelToExplore?

NiPersson commented 1 year ago

Yes, the only difference is that with longer max time to train/more models to explore, it takes longer time for the exception to occur. Usually close to the maxed allowed time except for when using only LightGBM. Then the exception occur with in a few seconds.

Nicklas

Den tis 23 maj 2023 22:41Xiaoyun Zhang @.***> skrev:

Have you tried adding more time or use SetMaxModelToExplore?

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning/issues/6706#issuecomment-1560096464, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARQQHZJE32WSSJJCVCGKUZ3XHUOH7ANCNFSM6AAAAAAYL2UFPU . You are receiving this because you authored the thread.Message ID: @.***>

LittleLittleCloud commented 1 year ago

Your code runs well on my end

image

I'm targeting to the latest AutoML package though, but 0.20.1 should also work. Can you share your stacktrace if available.

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <RootNamespace>ConsoleApp1</RootNamespace>
    <TargetFramework>net6.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Microsoft.ML.AutoML" Version="0.21.0-preview.23266.6" />
  </ItemGroup>

</Project>
NiPersson commented 1 year ago

image

LittleLittleCloud commented 1 year ago

The stacktrace looks fine. It's the expected stacktrace when time's up.

So the issue you have is no trial can be run til completed no matter how much time you set for SetTrainingTime?

NiPersson commented 1 year ago

Hi, sorry for the late answer. No matter the time one sets for the training this persists. What we have found is that it is perhaps thread related. The sample I supplied above works in a "clean and new project" but not where it is used in our project. In the c# code examples I have seen one usually uses:

await experiment.RunAsync();

which I guess translates to:

Dim t = experiment.RunAsync() t.Wait()

in visual basic. but this still gives the same exception.

LittleLittleCloud commented 1 year ago

@NiPersson How many training thread would your project kicks off when starting

NiPersson commented 1 year ago

We have only one thread used for training but large and complex project.

LittleLittleCloud commented 1 year ago

@NiPersson How about the dataset? Is the dataset still \taxi-fare-train.tsv?

LittleLittleCloud commented 1 year ago

@NiPersson And did you dispose or call MLContext.Cancel anywhere in your project?

NiPersson commented 1 year ago

@NiPersson How about the dataset? Is the dataset still \taxi-fare-train.tsv?

I have tested with this dataset alongside what we normally use.

NiPersson commented 1 year ago

MLContext.Cancel

Does not RunAsync call it?

// // Summary: // Run experiment and return the best trial result asynchronizely. The experiment // returns the current best trial result if there's any trial completed when ct // get cancelled, and throws System.TimeoutException with message "Training time // finished without completing a trial run" when no trial has completed. Another // thing needs to notice is that this function won't immediately return after ct // get cancelled. Instead, it will call Microsoft.ML.MLContext.CancelExecution to // cancel all training process and wait all running trials get cancelled or completed. public async Task RunAsync(CancellationToken ct = default(CancellationToken))

Other than that, It does not look like I do.

LittleLittleCloud commented 1 year ago

@NiPersson can you try disable lightGbm and train again? Looks like the exception was thrown from lightgbm trainer. In AutoML2.0, we add a CheckAlive checkpoint in lightGBM trainer, that might be the cause

NiPersson commented 1 year ago

Already tested that... All trainers produce this exception

LittleLittleCloud commented 1 year ago

Then my best bet without access to your project or minial reproducable example is the exception is caused by time-out. Is your project compute-instense? Have you try setting a super long time budget?

NiPersson commented 1 year ago

One thing I recalled now was with a "clean project", if you ran it from a thread, you also got the exception:

Task.Run(Sub() ' Call example code End Sub)

NiPersson commented 1 year ago

Then my best bet without access to your project or minial reproducable example is the exception is caused by time-out. Is your project compute-instense? Have you try setting a super long time budget?

I would say it is not computer intense. Before the update I used 60 secs max for training. If I don't recall wrong I tested with max 30 minutes as the longest.

LittleLittleCloud commented 1 year ago

One thing I recalled now was with a "clean project", if you ran it from a thread, you also got the exception:

Task.Run(Sub() ' Call example code End Sub)

Can you provide a minimal example for it? I'm not quite familiar with vb.net

NiPersson commented 1 year ago

I can do that. Hopefully I have time in a few hours.

NiPersson commented 1 year ago

Hi, I had problem with recreating it as I described above but what I did find was that this code:

Imports System
Imports System.Data
Imports System.IO
Imports Microsoft.ML
Imports Microsoft.ML.AutoML
Imports Microsoft.ML.Data
Imports Microsoft.ML.DataOperationsCatalog

Module Module1
    Sub Main(args As String())

        Dim MLObject As MLObjects = New MLObjects

        MLObject.TrainModel()

    End Sub

    Public Class MLObjects

        Public Sub TrainModel()
            Dim dataPath As String = "C:\Aiolos\@TEST2\Data\MLNet\taxi-fare-train.csv"

            Dim ctx As MLContext = New MLContext()

            ' Infer column information
            Dim columnInference As ColumnInferenceResults = ctx.Auto().InferColumns(dataPath, labelColumnName:="fare_amount", groupColumns:=False)

            ' Create text loader
            Dim loader As TextLoader = ctx.Data.CreateTextLoader(columnInference.TextLoaderOptions)

            ' Load data into IDataView
            Dim data As IDataView = loader.Load(dataPath)

            ' Split into train (80%), validation (20%) sets
            Dim trainValidationData As TrainTestData = ctx.Data.TrainTestSplit(data, testFraction:=0.2)

            'Define pipeline
            Dim Pipeline As SweepablePipeline = ctx.Auto().Featurizer(data, columnInformation:=columnInference.ColumnInformation).Append(ctx.Auto().Regression(labelColumnName:=columnInference.ColumnInformation.LabelColumnName))

            ' Create AutoML experiment
            Dim experiment As AutoMLExperiment = ctx.Auto().CreateExperiment()

            ' Configure experiment
            experiment.SetPipeline(Pipeline).SetRegressionMetric(RegressionMetric.RSquared, labelColumn:=columnInference.ColumnInformation.LabelColumnName).SetTrainingTimeInSeconds(60).SetDataset(trainValidationData)

            ' Run experiment
            Dim experimentResults As TrialResult = experiment.Run
            Dim model = experimentResults.Model

            '' Run experiment
            'Dim t = experiment.RunAsync
            't.Wait()
            'Dim experimentResult As TrialResult = t.Result
            'Dim model = experimentResult.Model

        End Sub
    End Class
End Module

worked with a project created with Framework 4.8 but I did not make it work with a new project with .Net7 (got the above mentioned exception again).

LittleLittleCloud commented 1 year ago

That's an interesting find! @ericstj @JakeRadMSFT @michaelgsharp Do you know who can I reach out to for this issue?

ericstj commented 1 year ago

@LittleLittleCloud can you repro as well with the latest example? If folks are suspecting some threading difference perhaps examining the process at the time the exception is thrown might reveal a stuck thread, or a blocked task? Those sort of differences can happen machine/machine (or framework/framework) if there is a race condition involved. I gave the example a try and it ran to completion for me 🤷‍♂️

NiPersson commented 1 year ago

That's an interesting find! @ericstj @JakeRadMSFT @michaelgsharp Do you know who can I reach out to for this issue?

@LittleLittleCloud Did you manage to reproduce the issue?

ericstj commented 1 year ago

@NiPersson -- I just noticed from the above exchange that the stack trace you captured was in the debugger. Just double checking -- are you able to reproduce this outside the debugger? Running async/multi-threaded code that depends on timeouts can often have different behavior under the debugger since it can freeze threads. Just double checking that we're all on the same page here WRT to repro steps.

LittleLittleCloud commented 1 year ago

@NiPersson Unfortunately, I still can't reproduce the exception with your latest code.

NiPersson commented 1 year ago

@NiPersson Unfortunately, I still can't reproduce the exception with your latest code.

@LittleLittleCloud Me and a colleague tested this yesterday. For him the exception happened but rarely (2/23) with the .Net7 project and for me every time. The Framework 4.8 also misbehaved. I once got a missing LightGBM dll exception and my colleague got that exception every time. I you use visual studio, here is a visual studio project with the .Net7 project:

https://www.dropbox.com/scl/fi/ivt1ehfrdgmtl8kxuiz5p/MLTest.zip?rlkey=mt9k37qdmaomhhd9aqlgjs24c&dl=0

NiPersson commented 1 year ago

@NiPersson -- I just noticed from the above exchange that the stack trace you captured was in the debugger. Just double checking -- are you able to reproduce this outside the debugger? Running async/multi-threaded code that depends on timeouts can often have different behavior under the debugger since it can freeze threads. Just double checking that we're all on the same page here WRT to repro steps.

@ericstj I now tried it in release and the exception still occurs

NiPersson commented 1 year ago

I tested "my production code" (not the example code) with a real build on a production server and there it goes better than what I see locally. The first 13 models created had only one exception.

LittleLittleCloud commented 1 year ago

@NiPersson I'm using .net 6, let me try on .net 7

NiPersson commented 1 year ago

@LittleLittleCloud Did you manage to reproduce it?

LittleLittleCloud commented 1 year ago

@NiPersson I tried, still works, no luck with .net7

NiPersson commented 1 year ago

@LittleLittleCloud : If I have the visual studio; "Enable Just My Code" setting selected than the exception is masked in most cases for me, could that be the explanation? Also it does not seem to happen all the time for all user, my colleague as stated previously, only had the exception like on average 1/10.

LittleLittleCloud commented 1 year ago

I tried to run the example for quite a few times and didn’t hit an exception. It works all the time

Wondering if you can joint mlnet discord channel and ping me there? It would be easier to not debug on GitHub thread. You should be able to find the discord link on readme page

Get Outlook for iOShttps://aka.ms/o0ukef


From: NiPersson @.> Sent: Tuesday, August 15, 2023 5:09:47 AM To: dotnet/machinelearning @.> Cc: Mention @.>; Comment @.>; Subscribed @.***> Subject: Re: [dotnet/machinelearning] AutoML: 'System.OperationCanceledException' after upgrade to AutoML 0.20.1 (Issue #6706)

@LittleLittleCloudhttps://github.com/LittleLittleCloud : If I have the visual studio; "Enable Just My Code" setting selected than the exception is masked in most cases for me, could that be the explanation? Also it does not seem to happen all the time for all user, my colleague as stated previously, only had the exception like on average 1/10.

— Reply to this email directly, view it on GitHubhttps://github.com/dotnet/machinelearning/issues/6706#issuecomment-1678828458 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEAYLOXQ7VEBBZ2NS3WQBB3XVNRIZBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVIZTONZTGI3TSOJVHGSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJRGMZDAMRRGE3DNAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDCNZSGIYDAMRTGQ2YFJDUPFYGLJLMMFRGK3FFOZQWY5LFVIZTONZTGI3TSOJVHGTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you were mentioned.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

NiPersson commented 1 year ago

@LittleLittleCloud Hi again was "DotNetEvolution" the name of the channel? Incase so I did not find you there. But it could just be from lack of experience due to not using Discord :)

LittleLittleCloud commented 1 year ago

@NiPersson Sorry should be this one https://discord.com/invite/Atpktwt8

Once you are in, You can find my id #BigMiao

@luisquintanilla maybe we need to update discord link?

michaelgsharp commented 9 months ago

Closing due to stale issue. If its still a problem please re-open/create a new issue.