Open sergey-tihon opened 5 years ago
@justinormont @vinodshanbhag @CESARDELATORRE
@sergey-tihon: Currently the CLI tool will only take in the text format. We are planning support for the IDV/TDV binary format.
@daholste / @vinodshanbhag: It's interesting that we are recognizing the label column "IsCS" as type Single. Do we make use of the ML.NET file header (the #@ rows)? If not, how will we handle the, now, partial header row IsCS 19 0:""
?
@sergey-tihon I suspect you have a missing value or a number that is nether 1 or nor 0 in your isCS column. Can you please confirm or deny that? BTW how did you crate the original IDataView data that you eventually saves as text and binary?
I suspect you have a missing value or a number that is nether 1 or nor 0 in your isCS column. Can you please confirm or deny that?
@vinodshanbhag No, there is no missing values. I can share the file it will help - data-txt.txt
then I run it mlnet auto-train --task binary-classification --dataset "data-txt.txt" --label-column-name IsCS --cache on --max-exploration-time 60
BTW how did you crate the original IDataView data that you eventually saves as text and binary?
Here is the code
public class MyInput
{
[ColumnName("IsCS"), LoadColumn(19)]
public bool IsCs { get; set; }
[LoadColumn(0, 18), VectorType(19)]
public float[] Features { get; set; }
}
static void SaveDataSet()
{
var mlContext = new MLContext();
var dataView = mlContext.Data.LoadFromTextFile<MyInput>(path: DataFilePath, hasHeader: true, separatorChar: '\t');
var file = DataFilePath.Replace("raw.txt", "data-txt.txt");
using (var stream = System.IO.File.Create(file))
mlContext.Data.SaveAsText(dataView, stream);
}
@sergey-tihon Some of the rows start with value 20 for first column. That must be causing it... but I am puzzled how that can happen when you save the file using mlnet saver.
Would it be possible to send your original file?
@vinodshanbhag no, but I can share repro sample
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 IsCS
0 0.02615092 0 0 0 1.41421356 0.00000000 1.41421356 0.00000000 1.41421356 0.00000000 1.41421356 0.00000000 4 7 0.00000000 0 0 1535 0
0 0.02615092 0 0 0 1.41421356 0.00000000 1.41421356 0.00000000 1.41421356 0.00000000 1.41421356 0.00000000 4 7 0.00000000 0 0 1535 1
two equal row with difference in the last IsCS
column
SaveAsText
#@ TextLoader{
#@ header+
#@ sep=tab
#@ col=IsCS:BL:0
#@ col=Features:R4:1-19
#@ }
IsCS 19 0:""
20 2:0.02615092 6:1.41421354 8:1.41421354 10:1.41421354 12:1.41421354 14:4 15:7 19:1535
1 0 0.02615092 0 0 0 1.41421354 0 1.41421354 0 1.41421354 0 1.41421354 0 4 7 0 0 0 1535
For me it looks like ML.NET intelligently decided that 1st row contains too many zeros and would be better serialize it as dense vector. So 20
is the length of vector, then list of non-zero columns: column 2
with value 0.02615092
, column 6
with value 1.41421354
and so on
One more benefit from implementing this - improve column type! I hope that AutoML can use header with columns type information from the files.
Just today I added two more text columns to my data and expect that AutoML recognize 19 float and 2 text columns, but instead of this AutoML treat all columns as string
s and tries to apply OneHotEncoding
and OneHotHashEncoding
to float numbers 🙈
@sergey-tihon: Yes, certainly file additional issues as you run across bugs or ways to improve the product.
ML.NET support at least two types of
IDataView
serializations out of the box - text and binary files.So I can use one of two to prepare my data set for AutoML
But when I try to use serialized file as an input for AutoML (both CLI and GUI version) it unable to parse them.
Binary format
Using binary format
I see following error
Text format
With
--verbosity diag
it stuck on the linewith default verbosity
it return an error of type mismatch
but data file looks correct (it serialized by ML.NET). This is the header and first lines of dataset