dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

how to parse the ML.Net log file #4378

Closed PeterPann23 closed 5 years ago

PeterPann23 commented 5 years ago

What is the logic behind the debug_log.txt, if one would like to parse it, how would one go at it. it's not documented as far as I can tell, I can find things in it but I would like to load it using C# into a structured format.

I have not seem to have found a clever way to do this. can some one point me at the right direction?

justinormont commented 5 years ago

It's not generally meant to be machine parsable. What are you looking to parse from it?

The most important lines begin with a pipe "|", as the lines are meant for an ascii art table w/ a border. This gives you one line per iteration in the sweep, reporting the trainer and metrics.

(base) MacOs:~ justinormont$ grep "|" /private/tmp/blah/Demo/logs/debug_log.txt
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
|1    AveragedPerceptronOva                       0.9672         0.9695      56.5          0                     |
|2    SdcaMaximumEntropyMulti                     0.9590         0.9632      56.6          0                     |
|3    LightGbmMulti                               0.9754         0.9729     126.7          0                     |
|4    SymbolicSgdLogisticRegressionOva            0.9344         0.9341      57.2          0                     |
|5    FastTreeOva                                 0.9918         0.9900      49.3          0                     |
|6    LinearSvmOva                                0.9508         0.9504      53.2          0                     |
|7    LbfgsLogisticRegressionOva                  0.9754         0.9700      54.4          0                     |
|8    SgdCalibratedOva                            0.9672         0.9629      53.9          0                     |
|9    FastForestOva                               0.9836         0.9838      41.9          0                     |
|10   LbfgsMaximumEntropyMulti                    0.9754         0.9700      54.8          0                     |
|11   FastTreeOva                                 1.0000         1.0000      77.7          0                     |
|                                                     Summary                                                    |
|ML Task: multiclass-classification                                                                              |
|Dataset: Demo.TRAIN.tsv                                                                                         |
|Label : Label                                                                                                   |
|Total experiment time : 904.33 Secs                                                                             |
|Total number of models explored: 11                                                                             |
|                                              Top 5 models explored                                             |
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
|1    FastTreeOva                                 1.0000         1.0000      77.7         11                     |
|2    FastTreeOva                                 0.9918         0.9900      49.3          5                     |
|3    FastForestOva                               0.9836         0.9838      41.9          9                     |
|4    LightGbmMulti                               0.9754         0.9729     126.7          3                     |
|5    LbfgsLogisticRegressionOva                  0.9754         0.9700      54.4          7                     |
(base) MacOs:~ justinormont$ 

There are also MAML-ish lines printed, which give you a shorthand form of the pipeline created. These lines include the hyperparameters for the models.

If you're using the AutoML API (not the CLI or ModelBuilder), I would attach to the progressHandler to log these and not have to do log scraping.

PeterPann23 commented 5 years ago

OK, so I could get the models it tried with the parameters it used like that? I noted that by taking a "winner" and altering it I can try parameter's it did not try like unbalanced data set and improve it further. Is there a sample that one can look at to capture the logs with?

justinormont commented 5 years ago

Yes, the progressHandler returns the model and its metrics after each iteration.

AutoML should perhaps be sweeping over LightGBM's unbalanced sets hyperparameter when the dataset's skew is found to be high in the dataset statistics step. Feel free to put in a PR if you're up for such a task.

For implementing a progressHandler, there's an example for each task in the samples repo:

PeterPann23 commented 5 years ago

How does one get to the HyperParameters used? is there a way to get the options?

justinormont commented 5 years ago

The hyperparameters used in the models are not exposed publicly. They are in the callback, but not public.

You're looking for the Pipeline within the returned RunDetail which is sent to your progressHandler callback. Though you can see them in the debugger, or use reflection. This of course can break in the future.

https://github.com/dotnet/machinelearning/blob/e50c4d20012e0d62852f404ae443afca7dad043e/src/Microsoft.ML.AutoML/API/RunDetails/RunDetail.cs#L65-L109

PeterPann23 commented 5 years ago

It would be really helpfull to het the hyperparameters to manual tune the model. How does AutoML CLI print the hyperparameters to the log? is there a way to get them so one can improve on that? I know it generates the code for the winner but I might not always agree with who the winner is.

PeterPann23 commented 5 years ago

OK, needed to implement my ETL, not sure if you would like to share this after "cleaning it up" but I reverse engineered it and does what I need it to do.

Like any ETL it may not survive the next update of the file generating the output however works for now and we can process the dataset and have ml.net cli "suggest" some pipelines.

It's not the fastest most elegant but parses my directories in under a second on my pc and gives me a json dataset that I can use to inject defaults with like this.

var options = new FastForestBinaryTrainer.Options
{
    LabelColumnName= "Trend"
    ,DiskTranspose               = true 
    ,NumberOfLeaves              = GetOrDefault<FastForestBinaryTrainer.Options>(nameof(FastForestBinaryTrainer.Options.NumberOfLeaves),90)              
    ,MinimumExampleCountPerLeaf  = GetOrDefault<FastForestBinaryTrainer.Options>(nameof(FastForestBinaryTrainer.Options.MinimumExampleCountPerLeaf),50)  
    ,NumberOfTrees               = GetOrDefault<FastForestBinaryTrainer.Options>(nameof(FastForestBinaryTrainer.Options.NumberOfTrees),100)              
    ,GainConfidenceLevel         = GetOrDefault<FastForestBinaryTrainer.Options>(nameof(FastForestBinaryTrainer.Options.GainConfidenceLevel),0.7 )       
};

Could be I miss a few entries but here is the output: data.zip Here is the code logparser.zip

Feel free to close the issue

justinormont commented 5 years ago

Nice parser. As you say, a stronger API would make this process easier. It would take some thought on how to make a clean API to expose the hyperparameter values.

Perhaps could expose an options object for each trainer. Though this will run into a typeness issue.

We're also looking to alter many more pipeline parameters per sweep iteration, which would make this far more complex to expose the full pipeline with options in a clean manner.

Closing as requested.