[AutoML] Allow to use serialized IDataView as an input

sergey-tihon commented 5 years ago

ML.NET support at least two types of IDataView serializations out of the box - text and binary files.

So I can use one of two to prepare my data set for AutoML

using (var stream = File.Create(textFileName))
    mlContext.Data.SaveAsText(data, stream);

using (var stream = File.Create(binFileName))
    mlContext.Data.SaveAsBinary(data, stream);

But when I try to use serialized file as an input for AutoML (both CLI and GUI version) it unable to parse them.

Binary format

Using binary format

mlnet auto-train --task binary-classification --dataset "data-bin.idv" --label-column-name IsCS --cache on --max-exploration-time 60 --verbosity diag

I see following error

Inferring Columns ...
An Error occured during inferring columns
Unable to split the file provided into multiple, consistent columns.
Microsoft.ML.AutoML.InferenceException: Unable to split the file provided into multiple, consistent columns.
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferSplit(MLContext context, TextFileSample sample, Nullable`1 separatorChar, Nullable`1 allowQuotedStrings, Nullable`1 supportSparse)
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferColumns(MLContext context, String path, ColumnInformation columnInfo, Nullable`1 separatorChar, Nullable`1 allowQuotedStrings, Nullable`1 supportSparse, Boolean trimWhitespace, Boolean groupColumns)
   at Microsoft.ML.CLI.CodeGenerator.AutoMLEngine.InferColumns(MLContext context, ColumnInformation columnInformation)
   at Microsoft.ML.CLI.CodeGenerator.CodeGenerationHelper.GenerateCode()
   at Microsoft.ML.CLI.Program.<>c__DisplayClass1_0.<Main>b__0(NewCommandSettings options)
Please see the log file for more info.
Exiting ...

Text format

With --verbosity diag it stuck on the line

Inferring Columns ...
Creating Data loader ...
Loading data ...
Exploring multiple ML algorithms and settings to find you the best model for ML task: binary-classification
For further learning check: https://aka.ms/mlnet-cli
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
[Source=AutoML, Kind=Trace] Channel started

with default verbosity

mlnet auto-train --task binary-classification --dataset "data-txt.tsv" --label-column-name IsCS --cache on --max-exploration-time 60

it return an error of type mismatch

xploring multiple ML algorithms and settings to find you the best model for ML task: binary-classification
For further learning check: https://aka.ms/mlnet-cli
──────────────────────────
Waiting for the first iteration to complete ...                                                                                                                                       00:00:00
Exception occured while exploring pipelines:
Provided label column 'IsCS' was of type Single, but only type Boolean is allowed.
Please see the log file for more info.

but data file looks correct (it serialized by ML.NET). This is the header and first lines of dataset

#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=IsCS:BL:0
#@   col=Features:R4:1-19
#@ }
IsCS    19  0:""
0   2   0.259255171 0   0   0   1.41421354  0   1.41421354  0   1.41421354  0   1.41421354  0   3   6   0   0   1   1192
0   6   0.259255171 0   0   0   1.41421354  0   1.41421354  0   1.41421354  0   1.41421354  0   3   6   0   0   1   1192

srsaggam commented 5 years ago

@justinormont @vinodshanbhag @CESARDELATORRE

justinormont commented 5 years ago

@sergey-tihon: Currently the CLI tool will only take in the text format. We are planning support for the IDV/TDV binary format.

@daholste / @vinodshanbhag: It's interesting that we are recognizing the label column "IsCS" as type Single. Do we make use of the ML.NET file header (the #@ rows)? If not, how will we handle the, now, partial header row IsCS 19 0:""?

vinodshanbhag commented 5 years ago

@sergey-tihon I suspect you have a missing value or a number that is nether 1 or nor 0 in your isCS column. Can you please confirm or deny that? BTW how did you crate the original IDataView data that you eventually saves as text and binary?

sergey-tihon commented 5 years ago

I suspect you have a missing value or a number that is nether 1 or nor 0 in your isCS column. Can you please confirm or deny that?

@vinodshanbhag No, there is no missing values. I can share the file it will help - data-txt.txt then I run it mlnet auto-train --task binary-classification --dataset "data-txt.txt" --label-column-name IsCS --cache on --max-exploration-time 60

BTW how did you crate the original IDataView data that you eventually saves as text and binary?

Here is the code

    public class MyInput
    {
        [ColumnName("IsCS"), LoadColumn(19)]
        public bool IsCs { get; set; }

        [LoadColumn(0, 18), VectorType(19)]
        public float[] Features { get; set; }
    }

static void SaveDataSet()
{
    var mlContext = new MLContext();
    var dataView = mlContext.Data.LoadFromTextFile<MyInput>(path: DataFilePath, hasHeader: true, separatorChar: '\t');

    var file = DataFilePath.Replace("raw.txt", "data-txt.txt");
    using (var stream = System.IO.File.Create(file))
        mlContext.Data.SaveAsText(dataView, stream);
}

vinodshanbhag commented 5 years ago

@sergey-tihon Some of the rows start with value 20 for first column. That must be causing it... but I am puzzled how that can happen when you save the file using mlnet saver.

Would it be possible to send your original file?

sergey-tihon commented 5 years ago

@vinodshanbhag no, but I can share repro sample

Raw input

A1  A2  A3  A4  A5  A6  A7  A8  A9  A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 IsCS
0   0.02615092  0   0   0   1.41421356  0.00000000  1.41421356  0.00000000  1.41421356  0.00000000  1.41421356  0.00000000  4   7   0.00000000  0   0   1535    0
0   0.02615092  0   0   0   1.41421356  0.00000000  1.41421356  0.00000000  1.41421356  0.00000000  1.41421356  0.00000000  4   7   0.00000000  0   0   1535    1

two equal row with difference in the last IsCS column

Result of `SaveAsText`

#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=IsCS:BL:0
#@   col=Features:R4:1-19
#@ }
IsCS    19  0:""
20  2:0.02615092    6:1.41421354    8:1.41421354    10:1.41421354   12:1.41421354   14:4    15:7    19:1535
1   0   0.02615092  0   0   0   1.41421354  0   1.41421354  0   1.41421354  0   1.41421354  0   4   7   0   0   0   1535

For me it looks like ML.NET intelligently decided that 1st row contains too many zeros and would be better serialize it as dense vector. So 20 is the length of vector, then list of non-zero columns: column 2 with value 0.02615092, column 6 with value 1.41421354 and so on

sergey-tihon commented 5 years ago

One more benefit from implementing this - improve column type! I hope that AutoML can use header with columns type information from the files.

Just today I added two more text columns to my data and expect that AutoML recognize 19 float and 2 text columns, but instead of this AutoML treat all columns as strings and tries to apply OneHotEncoding and OneHotHashEncoding to float numbers 🙈

justinormont commented 5 years ago

@sergey-tihon: Yes, certainly file additional issues as you run across bugs or ways to improve the product.

dotnet / machinelearning