Closed LittleLittleCloud closed 3 years ago
Earlier discussion -- https://github.com/dotnet/machinelearning/pull/5445#pullrequestreview-552014908
My initial thoughts from https://github.com/dotnet/machinelearning/pull/5445#discussion_r543115446:
We can always duplicate the logger. Or attach a logger to the new context, and when called, have it pass the message to the original context.
@LittleLittleCloud : What type of message are you reading from the log? Log scraping is likely the only usable method currently.
In the longer term, we may want to have each component pass along a structured status message: { rows processed, percent complete, processing duration, current step name, memory, other stats }. ML․NET conveys very little information on the status of a training job.
The output from MAML was sometimes sufficient (examples: 1, 2, 3, 4). These give some notion of the progress of the training job.
Related issues on having an output verbosity level besides zero
& firehose
:
To quote an earlier issue comment:
As mentioned in https://github.com/dotnet/machinelearning/issues/3235,
MLContext.Log()
doesn't have a verbosity selection, so it's more of a firehose.If a verbosity argument is added to
MLContext.Log()
, the log output from there should be human readable to see general progress.I believe it's still hidden within the firehose of output and once the verbosity is scaled down, you should see messages like:
LightGBM objective=multiclassova [7] 'Loading data for LightGBM' finished in 00:00:15.6600468. [8] 'Training with LightGBM' started. ..................................................(00:30.58) 0/200 iterations ..................................................(01:00.9) 1/200 iterations ..................................................(01:31.2) 2/200 iterations ..................................................(02:01.4) 2/200 iterations ..................................................(02:31.9) 3/200 iterations ..................................................(03:02.5) 4/200 iterations ..................................................(03:32.9) 4/200 iterations ..................................................(04:03.6) 5/200 iterations ..................................................(04:34.4) 5/200 iterations ..................................................(05:04.8) 6/200 iterations
And naively extrapolating, there's around 2.7 hours left in the LightGBM training.
System information
Issue
What might happen
After some investigation, I believe the error is caused by one of the latest changes we made on how a trial is launched. In this PR #5445, it creates a new context instead of reusing the current context when starting a trial at the beginning. So when I subscribe to the log channel when calling API, it is actually listening to the current context's channel where no trial is ongoing. However, since that new context where the trial is ongoing is not available externally, there's no way to have a peek at training progress right now.