dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

AutoML - Can't get training progress during image training #5553

Closed LittleLittleCloud closed 3 years ago

LittleLittleCloud commented 3 years ago

System information

Issue

What might happen

After some investigation, I believe the error is caused by one of the latest changes we made on how a trial is launched. In this PR #5445, it creates a new context instead of reusing the current context when starting a trial at the beginning. So when I subscribe to the log channel when calling API, it is actually listening to the current context's channel where no trial is ongoing. However, since that new context where the trial is ongoing is not available externally, there's no way to have a peek at training progress right now.

justinormont commented 3 years ago

Earlier discussion -- https://github.com/dotnet/machinelearning/pull/5445#pullrequestreview-552014908

My initial thoughts from https://github.com/dotnet/machinelearning/pull/5445#discussion_r543115446:

We can always duplicate the logger. Or attach a logger to the new context, and when called, have it pass the message to the original context.


@LittleLittleCloud : What type of message are you reading from the log? Log scraping is likely the only usable method currently.

Future

In the longer term, we may want to have each component pass along a structured status message: { rows processed, percent complete, processing duration, current step name, memory, other stats }. ML․NET conveys very little information on the status of a training job.

The output from MAML was sometimes sufficient (examples: 1, 2, 3, 4). These give some notion of the progress of the training job.

Related issues on having an output verbosity level besides zero & firehose:

To quote an earlier issue comment:

As mentioned in https://github.com/dotnet/machinelearning/issues/3235, MLContext.Log() doesn't have a verbosity selection, so it's more of a firehose.

If a verbosity argument is added to MLContext.Log(), the log output from there should be human readable to see general progress.

I believe it's still hidden within the firehose of output and once the verbosity is scaled down, you should see messages like:

LightGBM objective=multiclassova
[7] 'Loading data for LightGBM' finished in 00:00:15.6600468.
[8] 'Training with LightGBM' started.
..................................................(00:30.58)  0/200 iterations
..................................................(01:00.9)   1/200 iterations
..................................................(01:31.2)   2/200 iterations
..................................................(02:01.4)   2/200 iterations
..................................................(02:31.9)   3/200 iterations
..................................................(03:02.5)   4/200 iterations
..................................................(03:32.9)   4/200 iterations
..................................................(04:03.6)   5/200 iterations
..................................................(04:34.4)   5/200 iterations
..................................................(05:04.8)   6/200 iterations

And naively extrapolating, there's around 2.7 hours left in the LightGBM training.