dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

[ML.Net c#, CLI, VS Builder] 1KB csv input file. Not sure what to do + NoColumn how-to #6309

Closed wil70 closed 2 years ago

wil70 commented 2 years ago

I'm have simple data set with 2 fields: c10 and c11. c10 is a float, c11 is a string. First row is the header. c10,c11 -1,a-1 1,a+1 -1,a-1 1,a+1 0,a+0 1,a+1 1,a+1 -1,a-1 1,a+1 1,a+1 -1,a-1 -1,a-1 -1,a-1 1,a+1 -1,a-1 -1,a-1 -1,a-1 0,a+0 1,a+1 1,a+1 1,a+1 1,a+1 -1,a-1 -1,a-1

As you can see this is very easy to solve visually:

  if -1 is presented the answer is a-1
  if +1 is presented the answer is a+1
  if 0 is presented the answer is a+0

If I run AutoML with the VS builder UI, it crash at the end with this

   at System.Version.VersionResult.SetFailure(ParseFailureKind failure, String argument)
   at System.Version.TryParseVersion(String version, VersionResult& result)
   at System.Version.Parse(String input)
   at System.Version..ctor(String version)
   at Microsoft.ML.ModelBuilder.Utils.Utilities.InstalledVersionNeedsUpdate(String installedString, String requestedString)
   at Microsoft.ML.ModelBuilder.Utils.Utilities.<InstallNugetPackageAsync>d__17.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.ML.ModelBuilder.ViewModels.TrainViewModel.<UpdateNugetDependenciesAsync>d__105.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
   at Microsoft.ML.ModelBuilder.ViewModels.TrainViewModel.<GenerateCodeBehindFilesAsync>d__104.MoveNext()

Here is the log

        Set log file path to C:\Users\Wilhelm\AppData\Local\Temp\MLVSTools\logs\MLModel1-JSSGYE.txt
        start nni training
        Experiment output folder: C:\Users\Wilhelm\AppData\Local\Temp\AutoML-NNI\Experiment-4BLNIV
        |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
        |0    SdcaMaximumEntropyMulti                     0.8250         0.8167       1.7          0                     |
        |1    SdcaLogisticRegressionOva                   0.8250         0.8167       3.5          1                     |
        |2    FastTreeOva                                 0.3967         0.3333       0.9          2                     |
        |3    LightGbmMulti                               0.0783         0.1333       0.2          3                     |
        |4    FastForestOva                               0.1783         0.2333       0.9          4                     |
        |5    SdcaLogisticRegressionOva                   0.8250         0.8167       3.4          5                     |
        |6    SdcaMaximumEntropyMulti                     0.8250         0.8167       0.9          6                     |
        |7    LbfgsMaximumEntropyMulti                    0.8250         0.8167       0.2          7                     |
        |8    LbfgsLogisticRegressionOva                  0.8250         0.8167       0.2          8                     |
        |9    FastTreeOva                                 0.8550         0.8167       0.9          9                     |
        |10   LightGbmMulti                               0.0783         0.1333       0.1         10                     |
        |11   SdcaLogisticRegressionOva                   0.8250         0.8167       3.5         11                     |
        |12   FastForestOva                               0.1783         0.2333       1.1         12                     |
        |13   SdcaMaximumEntropyMulti                     0.8250         0.8167       0.9         13                     |
        |14   LbfgsMaximumEntropyMulti                    0.8250         0.8167       0.1         14                     |
        |15   LightGbmMulti                               0.0783         0.1333       0.1         15                     |
        |16   FastTreeOva                                 0.3967         0.3333       1.2         16                     |
        |17   LbfgsLogisticRegressionOva                  0.6450         0.6667       0.2         17                     |
        |18   SdcaMaximumEntropyMulti                     0.3967         0.3333       0.9         18                     |
        |19   SdcaLogisticRegressionOva                   0.8250         0.8167       3.6         19                     |
        |20   FastForestOva                               0.1783         0.2333       1.2         20                     |
        |21   LbfgsMaximumEntropyMulti                    0.0783         0.1333       0.1         21                     |
        |22   FastTreeOva                                 0.9000         0.9000       1.3         22                     |
        |23   LightGbmMulti                               0.0783         0.1333       0.1         23                     |
        |24   SdcaMaximumEntropyMulti                     0.8250         0.8167       0.9         24                     |
        |25   LbfgsLogisticRegressionOva                  0.8250         0.8167       0.2         25                     |
        |26   FastForestOva                               0.1783         0.2333       1.5         26                     |
        |27   SdcaLogisticRegressionOva                   0.8250         0.8167       3.6         27                     |
        |28   FastTreeOva                                 0.9000         0.9000       1.8         28                     |
        |29   LbfgsMaximumEntropyMulti                    0.8250         0.8167       0.1         29                     |
        |30   LbfgsLogisticRegressionOva                  0.8250         0.8167       0.2         30                     |
        |31   SdcaMaximumEntropyMulti                     0.8250         0.8167       0.9         31                     |
        |32   LightGbmMulti                               0.0783         0.1333       0.1         32                     |
        |34   LightGbmMulti                               0.0783         0.1333       0.1         34                     |
        |35   SdcaLogisticRegressionOva                   0.8250         0.8167       3.7         35                     |
        |36   FastTreeOva                                 0.3967         0.3333       1.7         36                     |
        |37   SdcaMaximumEntropyMulti                     0.8250         0.8167       0.9         37                     |
        |38   LightGbmMulti                               0.0783         0.1333       0.1         38                     |
        |39   LbfgsMaximumEntropyMulti                    0.8250         0.8167       0.1         39                     |
        |40   LbfgsLogisticRegressionOva                  0.2283         0.2833       0.2         40                     |
        |41   FastForestOva                               0.1783         0.2333       1.9         41                     |
        |42   FastForestOva                               0.1783         0.2333       1.8         42                     |
        |43   SdcaLogisticRegressionOva                   0.3967         0.3333       3.7         43                     |
        |44   FastTreeOva                                 0.3967         0.3333       2.0         44                     |
        |45   LbfgsLogisticRegressionOva                  0.8250         0.8167       0.2         45                     |
        |46   LbfgsMaximumEntropyMulti                    0.8500         0.8500       0.1         46                     |
        |47   SdcaMaximumEntropyMulti                     0.8250         0.8167       1.0         47                     |
        |48   LightGbmMulti                               0.0783         0.1333       0.1         48                     |
        |49   FastForestOva                               0.1783         0.2333       2.1         49                     |

        ===============================================Experiment Results=================================================
        ------------------------------------------------------------------------------------------------------------------
        |                                                     Summary                                                    |
        ------------------------------------------------------------------------------------------------------------------
        |ML Task: Classification                                                                                         |
        |Dataset: S:\CATS\files\data_analysis\output\AggregatedFile\small_2.csv                                          |
        |Label : c11                                                                                                     |
        |Total experiment time : 56.28 Secs                                                                              |
        |Total number of models explored: 49                                                                             |
        ------------------------------------------------------------------------------------------------------------------

        |                                              Top 5 models explored                                             |
        ------------------------------------------------------------------------------------------------------------------
        |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
        |28   FastTreeOva                                 0.9000         0.9000       1.8         28                     |
        |22   FastTreeOva                                 0.9000         0.9000       1.3         22                     |
        |9    FastTreeOva                                 0.8550         0.8167       0.9          9                     |
        |45   LbfgsMaximumEntropyMulti                    0.8500         0.8500       0.1         45                     |
        |46   SdcaMaximumEntropyMulti                     0.8250         0.8167       1.0         46                     |
        ------------------------------------------------------------------------------------------------------------------

        Generate code behind files

        Copying generated code to project...
        Copying MLModel1.consumption.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
        Copying MLModel1.training.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
        COMPLETED

        Updating nuget dependencies...
        Starting update NuGet dependencies async.
        Installing nuget package, package ID: Microsoft.ML, package Version: 1.7.1

I extended the time to train

Crash:

   at System.Version.VersionResult.SetFailure(ParseFailureKind failure, String argument)
   at System.Version.TryParseVersion(String version, VersionResult& result)
   at System.Version.Parse(String input)
   at System.Version..ctor(String version)
   at Microsoft.ML.ModelBuilder.Utils.Utilities.InstalledVersionNeedsUpdate(String installedString, String requestedString)
   at Microsoft.ML.ModelBuilder.Utils.Utilities.<InstallNugetPackageAsync>d__17.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.ML.ModelBuilder.ViewModels.TrainViewModel.<UpdateNugetDependenciesAsync>d__105.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
   at Microsoft.ML.ModelBuilder.ViewModels.TrainViewModel.<GenerateCodeBehindFilesAsync>d__104.MoveNext()

Log:

    Generate code behind files

    Copying generated code to project...
    Copying MLModel1.consumption.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
    Copying MLModel1.training.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
    COMPLETED

    Updating nuget dependencies...
    Starting update NuGet dependencies async.
    Installing nuget package, package ID: Microsoft.ML, package Version: 1.7.1
    start nni training
    Experiment output folder: C:\Users\Wilhelm\AppData\Local\Temp\AutoML-NNI\Experiment-O11PPN
    |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
    |0    SdcaMaximumEntropyMulti                     0.8000         0.8000       1.6          0                     |
    |1    LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.2          1                     |
    |2    FastForestOva                               0.4067         0.4500       0.9          2                     |
    |3    SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9          3                     |
    |4    SdcaLogisticRegressionOva                   0.8000         0.8000       3.4          4                     |
    |5    FastTreeOva                                 0.3400         0.3500       0.8          5                     |
    |6    LbfgsLogisticRegressionOva                  0.8000         0.8000       0.1          6                     |
    |7    LightGbmMulti                               0.3067         0.3500       0.2          7                     |
    |8    FastForestOva                               0.4067         0.4500       0.9          8                     |
    |9    SdcaMaximumEntropyMulti                     0.3400         0.3500       0.9          9                     |
    |10   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         10                     |
    |11   LightGbmMulti                               0.3067         0.3500       0.1         11                     |
    |12   FastTreeOva                                 0.3400         0.3500       1.0         12                     |
    |13   SdcaLogisticRegressionOva                   0.8000         0.8000       3.4         13                     |
    |14   FastForestOva                               0.4067         0.4500       1.0         14                     |
    |15   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.1         15                     |
    |16   FastTreeOva                                 0.3400         0.3500       1.1         16                     |
    |17   LightGbmMulti                               0.3067         0.3500       0.1         17                     |
    |18   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         18                     |
    |19   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         19                     |
    |20   FastForestOva                               0.4067         0.4500       1.2         20                     |
    |21   SdcaLogisticRegressionOva                   0.3400         0.3500       3.5         21                     |
    |22   FastTreeOva                                 0.3400         0.3500       1.2         22                     |
    |23   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2         23                     |
    |24   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         24                     |
    |25   LightGbmMulti                               0.3067         0.3500       0.1         25                     |
    |26   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         26                     |
    |27   FastForestOva                               0.4067         0.4500       1.4         27                     |
    |28   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         28                     |
    |29   SdcaLogisticRegressionOva                   0.8000         0.8000       3.5         29                     |
    |30   SdcaMaximumEntropyMulti                     0.3400         0.3500       1.0         30                     |
    |31   LbfgsLogisticRegressionOva                  0.3067         0.3500       0.2         31                     |
    |32   FastTreeOva                                 0.3400         0.3500       1.5         32                     |
    |33   LightGbmMulti                               0.3067         0.3500       0.1         33                     |
    |34   SdcaLogisticRegressionOva                   0.3400         0.3500       3.5         34                     |
    |35   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         35                     |
    |36   FastForestOva                               0.4067         0.4500       1.7         36                     |
    |37   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2         37                     |
    |38   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         38                     |
    |39   LightGbmMulti                               0.3067         0.3500       0.1         39                     |
    |40   FastTreeOva                                 0.3400         0.3500       1.7         40                     |
    |41   FastForestOva                               0.4067         0.4500       1.9         41                     |
    |42   LbfgsMaximumEntropyMulti                    0.3067         0.3500       0.1         42                     |
    |43   SdcaLogisticRegressionOva                   0.3400         0.3500       3.6         43                     |
    |44   LbfgsLogisticRegressionOva                  0.3067         0.3500       0.2         44                     |
    |45   SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0         45                     |
    |46   FastTreeOva                                 0.3400         0.3500       1.9         46                     |
    |47   LightGbmMulti                               0.3067         0.3500       0.1         47                     |
    |48   FastForestOva                               0.4067         0.4500       2.0         48                     |
    |49   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         49                     |
    |50   SdcaMaximumEntropyMulti                     0.3400         0.3500       1.0         50                     |
    |51   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2         51                     |
    |52   SdcaLogisticRegressionOva                   0.8000         0.8000       3.7         52                     |
    |53   FastTreeOva                                 0.3400         0.3500       2.1         53                     |
    |54   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         54                     |
    |55   LightGbmMulti                               0.3067         0.3500       0.1         55                     |
    |57   SdcaLogisticRegressionOva                   0.8000         0.8000       3.7         57                     |
    |58   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         58                     |
    |59   SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0         59                     |
    |60   LbfgsLogisticRegressionOva                  0.3067         0.3500       0.2         60                     |
    |61   FastTreeOva                                 0.3400         0.3500       2.4         61                     |
    |62   SdcaLogisticRegressionOva                   0.8000         0.8000       3.6         62                     |
    |63   LightGbmMulti                               0.3067         0.3500       0.1         63                     |
    |64   SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0         64                     |
    |66   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         66                     |
    |67   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2         67                     |
    |68   FastTreeOva                                 0.3400         0.3500       2.5         68                     |
    |69   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         69                     |
    |70   SdcaLogisticRegressionOva                   0.8000         0.8000       3.7         70                     |
    |71   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         71                     |
    |72   FastForestOva                               0.4067         0.4500       2.6         72                     |
    |73   SdcaMaximumEntropyMulti                     0.3400         0.3500       1.0         73                     |
    |74   LightGbmMulti                               0.3067         0.3500       0.1         74                     |
    |75   FastTreeOva                                 0.3400         0.3500       2.6         75                     |
    |76   LbfgsLogisticRegressionOva                  0.3067         0.3500       0.2         76                     |
    |77   FastForestOva                               0.4067         0.4500       2.7         77                     |
    |78   SdcaLogisticRegressionOva                   0.3400         0.3500       3.6         78                     |
    |79   LightGbmMulti                               0.3067         0.3500       0.1         79                     |
    |80   SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0         80                     |
    |81   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         81                     |
    |82   FastForestOva                               0.4067         0.4500       2.8         82                     |
    |83   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2         83                     |
    |84   SdcaLogisticRegressionOva                   0.8000         0.8000       3.6         84                     |
    |85   FastTreeOva                                 0.3400         0.3500       2.9         85                     |
    |86   LightGbmMulti                               0.3067         0.3500       0.1         86                     |
    |87   FastForestOva                               0.4067         0.4500       3.0         87                     |
    |88   LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3         88                     |
    |89   SdcaMaximumEntropyMulti                     0.8000         0.8000       0.9         89                     |
    |90   LbfgsMaximumEntropyMulti                    0.3067         0.3500       0.1         90                     |
    |91   FastForestOva                               0.4067         0.4500       3.1         91                     |
    |92   FastTreeOva                                 0.3400         0.3500       3.1         92                     |
    |93   SdcaLogisticRegressionOva                   0.3400         0.3500       3.7         93                     |
    |94   LbfgsLogisticRegressionOva                  0.3067         0.3500       0.2         94                     |
    |95   LightGbmMulti                               0.3067         0.3500       0.1         95                     |
    |96   LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1         96                     |
    |97   SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0         97                     |
    |98   FastForestOva                               0.4067         0.4500       3.5         98                     |
    |99   SdcaLogisticRegressionOva                   0.8000         0.8000       3.7         99                     |
    |100  FastTreeOva                                 0.3400         0.3500       3.3        100                     |
    |101  LightGbmMulti                               0.3067         0.3500       0.1        101                     |
    |102  SdcaLogisticRegressionOva                   0.8000         0.8000       3.7        102                     |
    |103  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2        103                     |
    |104  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        104                     |
    |105  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        105                     |
    |106  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        106                     |
    |107  FastForestOva                               0.4067         0.4500       3.5        107                     |
    |108  FastTreeOva                                 0.3400         0.3500       3.5        108                     |
    |109  SdcaLogisticRegressionOva                   0.8000         0.8000       3.8        109                     |
    |110  LightGbmMulti                               0.3067         0.3500       0.1        110                     |
    |111  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        111                     |
    |112  FastForestOva                               0.4067         0.4500       3.6        112                     |
    |113  FastTreeOva                                 0.3400         0.3500       3.6        113                     |
    |114  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.2        114                     |
    |115  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        115                     |
    |116  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        116                     |
    |117  LightGbmMulti                               0.3067         0.3500       0.1        117                     |
    |118  SdcaLogisticRegressionOva                   0.3400         0.3500       3.8        118                     |
    |119  FastForestOva                               0.4067         0.4500       3.7        119                     |
    |120  FastTreeOva                                 0.3400         0.3500       4.2        120                     |
    |121  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        121                     |
    |122  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        122                     |
    |123  LightGbmMulti                               0.3067         0.3500       0.1        123                     |
    |124  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        124                     |
    |125  SdcaLogisticRegressionOva                   0.8000         0.8000       3.8        125                     |
    |126  FastForestOva                               0.4067         0.4500       3.8        126                     |
    |127  LightGbmMulti                               0.3067         0.3500       0.1        127                     |
    |128  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        128                     |
    |129  FastTreeOva                                 0.3400         0.3500       4.0        129                     |
    |130  SdcaLogisticRegressionOva                   0.8000         0.8000       3.7        130                     |
    |131  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        131                     |
    |132  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        132                     |
    |133  LightGbmMulti                               0.3067         0.3500       0.2        133                     |
    |134  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        134                     |
    |135  FastForestOva                               0.4067         0.4500       4.6        135                     |
    |136  SdcaLogisticRegressionOva                   0.8000         0.8000       3.8        136                     |
    |137  FastTreeOva                                 0.3400         0.3500       4.5        137                     |
    |138  LightGbmMulti                               0.3067         0.3500       0.1        138                     |
    |139  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        139                     |
    |140  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        140                     |
    |141  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        141                     |
    |142  FastForestOva                               0.4067         0.4500       4.5        142                     |
    |143  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        143                     |
    |144  SdcaLogisticRegressionOva                   0.8000         0.8000       3.9        144                     |
    |145  FastTreeOva                                 0.3400         0.3500       4.2        145                     |
    |146  LightGbmMulti                               0.3067         0.3500       0.1        146                     |
    |147  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        147                     |
    |148  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        148                     |
    |149  FastForestOva                               0.4067         0.4500       4.3        149                     |
    |150  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        150                     |
    |151  LightGbmMulti                               0.3067         0.3500       0.1        151                     |
    |152  SdcaLogisticRegressionOva                   0.8000         0.8000       3.8        152                     |
    |153  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.1        153                     |
    |154  FastTreeOva                                 0.3400         0.3500       4.6        154                     |
    |155  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        155                     |
    |156  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        156                     |
    |157  SdcaLogisticRegressionOva                   0.8000         0.8000       3.8        157                     |
    |158  FastForestOva                               0.4067         0.4500       5.1        158                     |
    |159  LightGbmMulti                               0.3067         0.3500       0.1        159                     |
    |160  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        160                     |
    |161  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        161                     |
    |162  FastTreeOva                                 0.3400         0.3500       5.0        162                     |
    |163  SdcaLogisticRegressionOva                   0.8000         0.8000       3.9        163                     |
    |164  LightGbmMulti                               0.3067         0.3500       0.2        164                     |
    |166  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        166                     |
    |167  LightGbmMulti                               0.3067         0.3500       0.1        167                     |
    |168  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.2        168                     |
    |169  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        169                     |
    |170  SdcaLogisticRegressionOva                   0.8000         0.8000       3.9        170                     |
    |171  FastForestOva                               0.4067         0.4500       4.7        171                     |
    |172  FastTreeOva                                 0.3400         0.3500       5.1        172                     |
    |173  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        173                     |
    |174  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        174                     |
    |175  LightGbmMulti                               0.3067         0.3500       0.2        175                     |
    |177  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        177                     |
    |178  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        178                     |
    |179  SdcaLogisticRegressionOva                   0.8000         0.8000       3.9        179                     |
    |180  FastTreeOva                                 0.3400         0.3500       4.8        180                     |
    |181  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        181                     |
    |182  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.1        182                     |
    |183  FastForestOva                               0.4067         0.4500       5.0        183                     |
    |184  LightGbmMulti                               0.3067         0.3500       0.2        184                     |
    |185  SdcaLogisticRegressionOva                   0.8000         0.8000       3.9        185                     |
    |186  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.1        186                     |
    |187  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        187                     |
    |188  FastTreeOva                                 0.3400         0.3500       5.0        188                     |
    |189  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        189                     |
    |190  SdcaLogisticRegressionOva                   0.8000         0.8000       4.0        190                     |
    |191  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.2        191                     |

    ===============================================Experiment Results=================================================
    ------------------------------------------------------------------------------------------------------------------
    |                                                     Summary                                                    |
    ------------------------------------------------------------------------------------------------------------------
    |ML Task: Classification                                                                                         |
    |Dataset: S:\CATS\files\data_analysis\output\AggregatedFile\small_2.csv                                          |
    |Label : c11                                                                                                     |
    |Total experiment time : 295.89 Secs                                                                             |
    |Total number of models explored: 188                                                                            |
    ------------------------------------------------------------------------------------------------------------------

    |                                              Top 5 models explored                                             |
    ------------------------------------------------------------------------------------------------------------------
    |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
    |187  LbfgsMaximumEntropyMulti                    0.8000         0.8000       0.2        187                     |
    |186  SdcaLogisticRegressionOva                   0.8000         0.8000       4.0        186                     |
    |185  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.0        185                     |
    |183  LbfgsLogisticRegressionOva                  0.8000         0.8000       0.3        183                     |
    |182  SdcaMaximumEntropyMulti                     0.8000         0.8000       1.1        182                     |
    ------------------------------------------------------------------------------------------------------------------

    Generate code behind files

    Copying generated code to project...
    Copying MLModel1.consumption.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
    Copying MLModel1.training.cs to folder: G:\Users\Wilhelm\dev\MachineLearning\ML1
    COMPLETED

    Updating nuget dependencies...
    Starting update NuGet dependencies async.
    Installing nuget package, package ID: Microsoft.ML, package Version: 1.7.1

I would have excepted it to have found a good results

So I tried with 1 algo via c# and I got this:

    Processing LightGbm
    BestRun TrainerName:        LightGbmMulti
        - MicroAccuracy:        1
        - MacroAccuracy:        1
        - LogLoss:              0.08765842991120704
        - CLogLossReduction:    Infinity

    Class log loss:
        - class 1: 0
        - class 2: 0.09051613072327441
        - class 3: 0

    ConfusionMatrix.PerClassPrecision:
    class 1: 0
    class 2: 1
    class 3: 0

    ConfusionMatrix.PerClassPrecision:
        - class 1: 0, 0, 0,
        - class 2: 0, 3, 0,
        - class 3: 0, 0, 0,
    Saving model!!!
    Done saving...

Shouldn't the Confusion Matrix look like this?

I'm running the other Trainer and will post once done...right now it seems stuck on AveragedPerceptronOva, its been 20 min already....

Update after some time (I kill the program when too long so it can move to the next trainer):

      Processing LightGbm
      BestRun TrainerName:        LightGbmMulti
          - MicroAccuracy:        1
          - MacroAccuracy:        1
          - LogLoss:              0.08765842991120704
          - CLogLossReduction:    Infinity

      Class log loss:
          - class 1: 0
          - class 2: 0.09051613072327441
          - class 3: 0

      ConfusionMatrix.PerClassPrecision:
      class 1: 0
      class 2: 1
      class 3: 0

      ConfusionMatrix.PerClassPrecision:
          - class 1: 0, 0, 0,
          - class 2: 0, 3, 0,
          - class 3: 0, 0, 0,
      Saving model!!!
      Done saving...

      Duration: 00:00:00.7684543
      ---------------------

      Processing AveragedPerceptronOva
      ^CTerminate batch job (Y/N)? n

      --------------------------------
      Processing FastForestOva
      BestRun TrainerName:        FastForestOva
          - MicroAccuracy:        1
          - MacroAccuracy:        1
          - LogLoss:              0.13785668008671875
          - CLogLossReduction:    Infinity

      Class log loss:
          - class 1: 0
          - class 2: 0.1283592552410416
          - class 3: 0

      ConfusionMatrix.PerClassPrecision:
      class 1: 0
      class 2: 1
      class 3: 0

      ConfusionMatrix.PerClassPrecision:
          - class 1: 0, 0, 0,
          - class 2: 0, 3, 0,
          - class 3: 0, 0, 0,
      Saving model!!!
      Done saving...

      Duration: 00:00:01.3639996
      ---------------------

      Processing FastTreeOva
      BestRun TrainerName:        FastTreeOva
          - MicroAccuracy:        1
          - MacroAccuracy:        1
          - LogLoss:              0.38632734814558706
          - CLogLossReduction:    Infinity

      Class log loss:
          - class 1: 0
          - class 2: 0.3665745755147897
          - class 3: 0

      ConfusionMatrix.PerClassPrecision:
      class 1: 0
      class 2: 1
      class 3: 0

      ConfusionMatrix.PerClassPrecision:
          - class 1: 0, 0, 0,
          - class 2: 0, 3, 0,
          - class 3: 0, 0, 0,
      Saving model!!!
      Done saving...

      Duration: 00:00:02.0520908
      ---------------------

      Processing LinearSupportVectorMachinesOva
      ^CTerminate batch job (Y/N)? n

      -----------------------------------------
      Processing LbfgsMaximumEntropy
      ^CTerminate batch job (Y/N)? n

      -----------------------------------------------------------
      Processing LbfgsLogisticRegressionOva
      ^CTerminate batch job (Y/N)? n

     -------------------------------------------------------
      Processing SdcaMaximumEntropy
      BestRun TrainerName:        SdcaMaximumEntropyMulti
          - MicroAccuracy:        1
          - MacroAccuracy:        1
          - LogLoss:              0.0034730878249029256
          - CLogLossReduction:    Infinity

      Class log loss:
          - class 1: 0
          - class 2: 0.0021820014848649345
          - class 3: 0

      ConfusionMatrix.PerClassPrecision:
      class 1: 0
      class 2: 1
      class 3: 0

      ConfusionMatrix.PerClassPrecision:
          - class 1: 0, 0, 0,
          - class 2: 0, 3, 0,
          - class 3: 0, 0, 0,
      Saving model!!!
      Done saving...

      Duration: 00:00:02.0249306
      ---------------------

      Processing SgdCalibratedOva
      ^CTerminate batch job (Y/N)? n

      -------------------------------
      Processing SymbolicSgdLogisticRegressionOva
      ^CTerminate batch job (Y/N)? n

===================================== More or less related issues:

LittleLittleCloud commented 2 years ago

@wil70

The log in VS model builder indicates the failure is from restoring nuget package and not from AutoML, and it should be ignorable.

And the dataset looks to be imbalanced, where label a+0 appears much less often than the other two labels. In fact the label a+0 only appear 2 times, which might cause label missing in training set (after splitting train/test datasplit, label a+0 is missing in training dataset). And after I manually balance the dataset by increasing a+0, training result looks better

image

Shouldn't the Confusion Matrix look like this?

  • class 1: 11, 0, 0,
  • class 2: 0, 11, 0,
  • class 3: 0, 0, 2,

It depends on what your test dataset looks like. I assume your test dataset only contains 3 piece of data?

wil70 commented 2 years ago

Thanks @LittleLittleCloud ,

a) Yes I shall have put the same number of inputs for each label or close enough number. Good explanation for training vs testing ds - tY!

b) For the confusion matrix, let's pretend the dataset is the following (first row is header, with eleven a-1, eleven a+1 and Two a+0)

              c10,c11
        -1,a-1
        1,a+1
        -1,a-1
        1,a+1
        0,a+0
        1,a+1
        1,a+1
        -1,a-1
        1,a+1
        1,a+1
        -1,a-1
        -1,a-1
        -1,a-1
        1,a+1
        -1,a-1
        -1,a-1
        -1,a-1
        0,a+0
        1,a+1
        1,a+1
        1,a+1
        1,a+1
        -1,a-1
        -1,a-1

Should the confusion Matrix look like if there is a perfect solution?

                class 1 'a+1': **11**, 0, 0,
        class 2 'a-1': 0, **11**, 0,
        class 3 'a+0': 0, 0, **2**,

I guess I'm not understanding the 3 in

- class 1: 0, 0, 0,
        - class 2: 0, 3, 0,
        - class 3: 0, 0, 0,

thanks

Wil

LittleLittleCloud commented 2 years ago

@wil70

According to that per-class precision you shared

looks like the dataset you use to calculate that matrix has 0 class-1, 3 class-2 and 0 class-3 data, which is in total 3 pieces of data. So did you split your dataset into train and test, using train to train a model and calcuating per-class precision matrix on test dataset?

wil70 commented 2 years ago

Thanks @LittleLittleCloud

Yeah, so this is incorrect, isn't it? as we know there are eleven a+1 and a-1 as well as 2 a+0. Except as you said if there is a default split (training vs testing)

I'm new at ML.net. I'm trying to evaluate if ML.net will work out for me with huge dataset later. I wrote this tiny code. Please set the "file" path is to the dataset we mentioned above (with the header c10,c11)

I tried with all the MulticlassClassificationTrainer available for trainerID but I always got the confusion Matrix with 3. I do not know how to split the data yet form the c# api, may be it has a default setup?

                var mlContext = new MLContext();

                IDataView trainingData = mlContext.Data.LoadFromTextFile<ModelInput2>(
                    file,
                    separatorChar: ',', hasHeader: true, trimWhitespace: true);

                var cts = new CancellationTokenSource();
                var experimentSettings = new MulticlassExperimentSettings();
                experimentSettings.MaxExperimentTimeInSeconds = 9 * 3600; // 1800;// 37800;// 120;// 3600;
                experimentSettings.CancellationToken = cts.Token;
                experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Auto;
                experimentSettings.Trainers.Clear();
                experimentSettings.Trainers.Add(trainerID);
                experimentSettings.CacheDirectoryName = null;

                Console.WriteLine("Processing " + trainerID.ToString());

                MulticlassClassificationExperiment experiment = mlContext.Auto().CreateMulticlassClassificationExperiment(experimentSettings);
                ExperimentResult<MulticlassClassificationMetrics> experimentResult = experiment.Execute(trainingData, "Action");

                if (experimentResult != null && experimentResult.BestRun != null)
                {
                    MulticlassClassificationMetrics metrics = experimentResult.BestRun.ValidationMetrics;
                    Console.WriteLine($"BestRun TrainerName:        {experimentResult.BestRun.TrainerName}");
                    Console.WriteLine($"    - MicroAccuracy:        {metrics.MicroAccuracy}");
                    Console.WriteLine($"    - MacroAccuracy:        {metrics.MacroAccuracy}");
                    Console.WriteLine($"    - LogLoss:              {metrics.LogLoss}");
                    Console.WriteLine($"    - CLogLossReduction:    {metrics.LogLossReduction}");

                    Console.WriteLine($"\nClass log loss:");
                    int i = 1;
                    foreach (double d in metrics.PerClassLogLoss)
                    {
                        Console.WriteLine($"    - class {i}: {metrics.PerClassLogLoss[i - 1]}");
                        i++;
                    }

                    i = 1;
                    Console.WriteLine($"\nConfusionMatrix.PerClassPrecision:");
                    foreach (double d in metrics.ConfusionMatrix.PerClassPrecision)
                    {
                        Console.WriteLine($"class {i}: {d}");
                        i++;
                    }

                    i = 1;
                    Console.WriteLine($"\nConfusionMatrix.PerClassPrecision:");
                    foreach (IReadOnlyList<double> d1 in metrics.ConfusionMatrix.Counts)
                    {
                        Console.Write($"    - class {i}: ");
                        foreach (double d2 in d1)
                        {
                            Console.Write($"{d2}, ");
                        }
                        i++;
                        Console.WriteLine();
                    }

and here is the class to model the input data

        public class ModelInput2
        {
            //[LoadColumn(0), NoColumn]
            //public float _459 { get; set; }
            //[LoadColumn(2, 9), NoColumn]
            //public float _data { get; set; }

            [LoadColumn(0)] // c10
            public float _460 { get; set; }

            [LoadColumn(1)]//, ColumnName("c11")]
            public string Action { get; set; }
        }

Thanks for your help

Wil

LittleLittleCloud commented 2 years ago

The matrix from MulticlassExperiment is evaluated on validation dataset and in your case, since you are running a cross-validation (that's default setting for small dataset), which means that 10% of the entire training dataset will be held out as validation dataset, which is 24 * 0.1 ~ 3 pieces of data.

wil70 commented 2 years ago

Super, TY that explains why.is there a way to get the confusion matrix of the training dataset vs the validation dataset?

LittleLittleCloud commented 2 years ago

Well, you can always re-evaluate your model with another dataset

IDataView trainData
ITransformer model
var eval = model.Transform(trainData)
var metric = context.Multiclass.Evaluate(eval, label...)

// metric.ConfusionMetrix
LittleLittleCloud commented 2 years ago

Cool, I'm going to close this issue since it seems that the question has been resolved. Feel free to ping me if you have any other questions.

wil70 commented 2 years ago

yes we can close it - TY!

wil70 commented 2 years ago

Hi @LittleLittleCloud

For my understanding,

1) I added an extra c09 column in the excel file that is a duplicate of the c10 column.

My goal is to ignore this new column "c09" from the code during training and consumption. I thought using NoColumn (Attribute) shall do the trick but this is hard to prove?

public class ModelInput
        {
            [LoadColumn(0), ColumnName(@"c09"), **NoColumn**]
            public float C09 { get; set; }

            [LoadColumn(1), ColumnName(@"c10")]
            public float C10 { get; set; }

            [LoadColumn(2), ColumnName(@"c11")]
            public string C11 { get; set; }

        }

I thought doing a AutoML training with the wizard and specifically marking the column as Hidden would show me how to do it if NoColumn is or not the right way? I tried not having "LoadColumn(...)" but that trigger an error soon as I use IDataView testData = mlContext.Data.LoadFromTextFile(....)

image

But the MyCode.consumption.cs file generated by the AutoML wizard generated this

        /// <summary>
        /// model input class for MLModel1.
        /// </summary>
        #region model input class
        public class ModelInput
        {
            [ColumnName(@"c09")] **// Note: the NoColumn is not there in the code automatically generated?**
            public float C09 { get; set; }

            [ColumnName(@"c10")]
            public float C10 { get; set; }

            [ColumnName(@"c11")]
            public string C11 { get; set; }

        }

        #endregion

        /// <summary>
        /// model output class for MLModel1.
        /// </summary>
        #region model output class
        public class ModelOutput **// Note: Can ModepOuput inherit from ModelInput instead of repeating the fields? I'm asking as in some case you might have thousands of fields...**        
       {
            [ColumnName(@"c09")] **// Note: the NoColumn is not there in the code automatically generated?**
            public float C09 { get; set; }

            [ColumnName(@"c10")]
            public float C10 { get; set; }

            [ColumnName(@"c11")]
            public uint C11 { get; set; }

            [ColumnName(@"Features")]
            public float[] Features { get; set; }

            [ColumnName(@"PredictedLabel")]
            public string PredictedLabel { get; set; }

            [ColumnName(@"Score")]
            public float[] Score { get; set; }

        }

The automatic code generated doesn't seem to ignore c09 neither?


       private static string MLNetModelPath = Path.GetFullPath("MLModel1.zip");

        public static readonly Lazy<PredictionEngine<ModelInput, ModelOutput>> PredictEngine = new Lazy<PredictionEngine<ModelInput, ModelOutput>>(() => CreatePredictEngine(), true);

        /// <summary>
        /// Use this method to predict on <see cref="ModelInput"/>.
        /// </summary>
        /// <param name="input">model input.</param>
        /// <returns><seealso cref=" ModelOutput"/></returns>
        public static ModelOutput Predict(ModelInput input)
        {
            var predEngine = PredictEngine.Value;
            return predEngine.Predict(input);
        }

        private static PredictionEngine<ModelInput, ModelOutput> CreatePredictEngine()
        {
            var mlContext = new MLContext();
            ITransformer mlModel = mlContext.Model.Load(MLNetModelPath, out var _);
            return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
        }

I guess I can always generate code do add a mlContext.Transform...a NoColumn or Hidden attribute seems a very good shortcut for thousands of columns.

2) I'm trying to use the code you sent me above mlContext.Multicast.Evaluate(..)

I keep having the an expception "System.ArgumentOutOfRangeException: 'Score column 'Score' not found Parameter name: schema'" and I'm not able to figure out how to make it work:


            Stopwatch stopw = new Stopwatch();
            stopw.Start();

            try
            {
                var mlContext = new MLContext();

                IDataView testData = mlContext.Data.LoadFromTextFile<MLModel1.ModelInput>("S:\\CATS\\files\\data_analysis\\output\\AggregatedFile\\small_2.csv", separatorChar: ',', hasHeader: true, trimWhitespace: true);
                DataView trainData = new DataView();
                ITransformer mlModel = mlContext.Model.Load(Path.GetFullPath(@"G:\Users\Wilhelm\dev\MachineLearning\ML1\MLModel1.zip"), out var _);
                //return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
                //var eval = mlModel.Transform(testData);
                MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");
                Console.WriteLine(metric.ConfusionMatrix.GetFormattedConfusionTable()); //PrintConfusionMatrix("LightGbm", metric);
            }
            catch (Exception e)
            {
                Console.Out.WriteLine(e.Message);
                if (e.InnerException != null) Console.Out.WriteLine(e.InnerException.Message);
                if (e.StackTrace != null) Console.Out.WriteLine(e.StackTrace);
            }
            finally
            {
                stopw.Stop();
                Console.Out.WriteLine("\nDuration: " + stopw.Elapsed);
            }

Exception:

System.ArgumentOutOfRangeException
  HResult=0x80131502
  Message=Score column 'Score' not found
Parameter name: schema
  Source=Microsoft.ML.Core
  StackTrace:
   at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.RoleMappedData..ctor(IDataView data, Boolean opt, KeyValuePair`2[] roles)
   at Microsoft.ML.Data.MulticlassClassificationEvaluator.Evaluate(IDataView data, String label, String score, String predictedLabel)
   at ML1.Program.Main(String[] args) in G:\Users\Wilhelm\dev\MachineLearning\ML1\Program.cs:line 35

  This exception was originally thrown at this call stack:
    [External Code]
    ML1.Program.Main(string[]) in Program.cs

Thanks a lot for your help

Wil

LittleLittleCloud commented 2 years ago

How to ignore column

You just not mark that column with LoadColumn and that should be it. If you don't want column c09, just not put LoadColumn attribution

I tried not having "LoadColumn(...)" but that trigger an error soon as I use IDataView testData = mlContext.Data.LoadFromTextFile(....)

what error you have

System.ArgumentOutOfRangeException HResult=0x80131502 Message=Score column 'Score' not found

The error basically says it can't find Score in testData, which is true, right? You need to put the evaluation result eval in Evaluate api

wil70 commented 2 years ago

Thanks @LittleLittleCloud

1) thanks! 2) The problem is you need ModelInput (define here https://github.com/dotnet/machinelearning/issues/6309#issuecomment-1237400143) for reading input data from the csv file (Note: Features, predictedLabels and scores are not in in this class), but then you need those 3 fields to evaluate, so I have ModelOuput with those (ModelOutput inherit from ModelInput but add those 3 columns Features, predictedLabels and scores) to evaluate.... I could create a new input file with the 3 extra fields with "empty or default" values but imagine for a file of 330GB or 2TB...

Basically, how do I feed ModelOuput to "mlContext.MulticlassClassification.Evaluate(testData, "c11");" knowing the testdata has been created with ModelInput?

Note: I believe I tried to add those 3 fields in ModelInput and not mark those 3 fields as LoadColumn but I think, if I recall well, it failed

Thank for your help

Wil

LittleLittleCloud commented 2 years ago

hi @wil70

I might not express clearly in topic 2. According to your response, you are using the following code to evaluate model

                var mlContext = new MLContext();

                IDataView testData = mlContext.Data.LoadFromTextFile<MLModel1.ModelInput>("S:\\CATS\\files\\data_analysis\\output\\AggregatedFile\\small_2.csv", separatorChar: ',', hasHeader: true, trimWhitespace: true);
                DataView trainData = new DataView();
                ITransformer mlModel = mlContext.Model.Load(Path.GetFullPath(@"G:\Users\Wilhelm\dev\MachineLearning\ML1\MLModel1.zip"), out var _);
                //return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
                //var eval = mlModel.Transform(testData);
                MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");
                Console.WriteLine(metric.ConfusionMatrix.GetFormattedConfusionTable()); //PrintConfusionMatrix("LightGbm", metric);

which throws ArgumentOutOfRangeException.

The cause is because in this two line

                //var eval = mlModel.Transform(testData);
                MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");

you are using testData instead of eval to evaluate your result. You need to first get eval by using your model to transform testData, and then evaluate metrics on eval with Evaluate api. So the right code should be

                var eval = mlModel.Transform(testData);
                MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(eval , "c11");

The column Score and PredictedLabel will be added by mlModel during transforming.

wil70 commented 2 years ago

Super - TY!

I see so only the IDataView from Transform will have the flexibility to add new column dynamically whereas the one from the LoadFromTextFile wouldn't - Super TY @LittleLittleCloud

wil70 commented 2 years ago

3) Retrain existing classifier model

I'm trying to start from a saved model so I can analyze results and add more training time if needed for that model, I dug (and will dig more over the weekend) into Fit with something like this:

                var eval = mlModel.Transform(traindata); 
                MulticlassPredictionTransformer<OneVersusAllModelParameters> transformer = mlContext.MulticlassClassification.Trainers.LightGbm(LABEL).Fit(eval);

It runs and I'm guessing is retraining (with the Fit above) but I'm not able to reuse the saved newly retrain model (from transformer above). I'm doing this:

mlContext.Model.Save(**transformer**, trainingData.Schema, "c:\\model_LightGbmMulti.zip");

I think it doesn't save the right retrain model. The schema of the retrained model saved must be correct as it shall be the same as the initial model's schema we started from to retrain. In other words, the pre-model's schema should be the same as the post-model's schema, the model shall be different though.

But when I try to load the newly retrained model and I get an exception

               var mlContext = new MLContext();

                IDataView testData = mlContext.Data.LoadFromTextFile<ModelInput>(file, separatorChar: ',', hasHeader: true, trimWhitespace: true);
                ITransformer mlModel = mlContext.Model.Load(MLNetModelPath, out var _);
                var eval = mlModel.Transform(testData); 
                MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(eval, LABEL);
                Console.WriteLine(metrics.ConfusionMatrix.GetFormattedConfusionTable());

It gives me this exception and I think it is not a schema issue as they should be the same but the retrain model saved?

Features column '**Feature**' not found (Parameter '**schema**')
   at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.RoleMappedSchema..ctor(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.PredictedLabelScorerBase.BindingsImpl.ApplyToSchema(DataViewSchema input, ISchemaBindableMapper bindable, IHostEnvironment env)
   at Microsoft.ML.Data.PredictedLabelScorerBase..ctor(IHostEnvironment env, PredictedLabelScorerBase transform, IDataView newSource, String registrationName)
   at Microsoft.ML.Data.MulticlassClassificationScorer..ctor(IHostEnvironment env, MulticlassClassificationScorer transform, IDataView newSource)
   at Microsoft.ML.Data.MulticlassClassificationScorer.ApplyToDataCore(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.RowToRowScorerBase.ApplyToData(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.PredictionTransformerBase`1.Transform(IDataView input)
   at ML1.Program.TestModel(String file) in G:\Users\Wilhelm\dev\MachineLearning\ML2\Program.cs:line 188

I can always do the following and this work, but this is way too cumbersome when you have thousands of columns There mut be a simpler way than the following V to save the retrain model?

  var pipeline = mlContext.Transforms.ReplaceMissingValues(@"c10", @"c10")      
                                    .Append(mlContext.Transforms.Concatenate(@"Features", new []{@"c10"}))      
                                    .Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName:@"c11",inputColumnName:@"c11"))      
                                    .Append(mlContext.Transforms.NormalizeMinMax(@"Features", @"Features"))      
                                    .Append(mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(new SdcaMaximumEntropyMulticlassTrainer.Options(){L1Regularization=1F,L2Regularization=0.1F,LabelColumnName=@"c11",FeatureColumnName=@"Features"}))      
                                    .Append(mlContext.Transforms.Conversion.MapKeyToValue(outputColumnName:@"PredictedLabel",inputColumnName:@"PredictedLabel"));
var model = pipeline.Fit(trainData);

mlContext.Model.Save(**model**, trainingData.Schema, "c:\\model_LightGbmMulti.zip"); 

Note: I read different articles like https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/retrain-model-ml-net but I'm not able to figure it out.

4) Call back to save and evaluate model during training

Is it possible to save the model every x minutes/hours/iterations so I can evaluate it...some kind of callback I need to dig into (Microsoft.ML.Data.IInternalCatalog)mlContext.MulticlassClassification.Trainers).Environment.ProgressTracker

Later ideally, I would like to chart the training result and test data set result thought time/x-iterations

Please let me know if you know a book or some good articles to guides me.

Thank you!

Wil cc: @LittleLittleCloud

wil70 commented 2 years ago

5) Single precision number

Hello, most of my code use double, I'm making sure it load as Single as ML.net support Single and not double from what I understand.

How can I know which ML.net Algorithm can handle positive or negative zero, PositiveInfinity, NegativeInfinity, and not a number (NaN)?

Those have values have semantical significance and might be interesting to keep those for ML.net algorithms that can handle them, for the algorithms that can not handle them I will transform those data somehow.

Thanks