dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Can't use LoadFromEnumerable in F# #6997

Open thomasd3 opened 4 months ago

thomasd3 commented 4 months ago

Using a BinaryClassifier, with AutoML in F#, I have my data structured at this type:

myData: float32 array list

So, each row is a float32 array and I've a list of rows. The first column is the label, all other columns are the features

Using context.Data.LoadFromEnumerable () does not work on this data type. While the list implements IEnumerable, I can't use that function.

For now, I'm using something very ugly: I write the data to a csv and I load it from disk:

the row is defined as:

    [<CLIMutable>]
    type DataRow =
        {
            [<ColumnName "Label"; LoadColumn(0)>]
            Label: bool
            [<ColumnName "Features"; LoadColumn(
                [|
                    001; 002; 003; 004; 005; 006; 007; 008; 009
                    011; 012; 013; 014; 015; 016; 017; 018; 019
                    021; 022; 023; 024; 025; 026; 027; 028; 029
                    031; 032; 033; 034; 035; 036; 037; 038; 039
                    041; 042; 043; 044; 045; 046; 047; 048; 049
                    051; 052; 053; 054; 055; 056; 057; 058; 059
                    061; 062; 063; 064; 065; 066; 067; 068; 069
                    071; 072; 073; 074; 075; 076; 077; 078; 079
                    081; 082; 083; 084; 085; 086; 087; 088; 089
                    091; 092; 093; 094; 095; 096; 097; 098; 099
                    101; 102; 103; 104; 105; 106; 107; 108; 109
                    111; 112; 113; 114; 115; 116; 117; 118; 119
                    121; 122; 123; 124; 125; 126; 127; 128; 129
                    131; 132; 133; 134; 135; 136; 137; 138; 139
                    141; 142; 143; 144; 145; 146; 147; 148; 149
                    151; 152; 153; 154; 155; 156; 157; 158; 159
                    161; 162; 163; 164; 165; 166; 167; 168; 169
                    171; 172; 173; 174; 175; 176; 177; 178; 179
                    181; 182; 183; 184; 185; 186; 187; 188; 189
                    191; 192; 193; 194; 195; 196; 197; 198; 199
                    201; 202; 203; 204; 205; 206; 207; 208; 209
                    211; 212; 213; 214; 215; 216; 217; 218; 219
                    221; 222; 223; 224; 225; 226; 227; 228; 229
                    231; 232; 233; 234; 235; 236; 237; 238; 239
                    241; 242; 243; 244; 245; 246; 247; 248; 249
                |]
            )>]
            Features: float32 array
        }

which makes very little sense...

and the loading code is even worse:

    let trainModel (l: float32 array list) =

        // create the MLContext
        let context = MLContext()

        // add logging
        context.Log.Add (fun m -> if m.Kind <> ChannelMessageKind.Trace then info m.RawMessage)

        // build a filename
        let filename = nanoGuid()

        // build the data as text
        let text =
            l
            |> List.map (fun r -> r |> Array.map string |> String.concat ",")
            |> String.concat "\n"

        // write it to a file
        File.WriteAllText(filename, text)

        // load the data
        let loadedData = context.Data.LoadFromTextFile<DataRow> (filename, hasHeader = false, separatorChar = ',')

        // shuffle the data
        let shuffledData = context.Data.ShuffleRows (loadedData)

how can I use LoadFromEnumerable from a list to avoid this?

michaelgsharp commented 4 months ago

@luisquintanilla you are most familiar with F#, any ideas?

luisquintanilla commented 4 months ago

Okay. I think I was able to validate.

> #r "nuget:Microsoft.ML";;
[Loading C:\Users\luquinta\.packagemanagement\nuget\Projects\12360--b816f9ce-1002-44bf-bd7f-bd9f8ee041a6\Project.fsproj.fsx]
module FSI_0002.Project.fsproj

> open Microsoft.ML;;
> let data = [[|1f;2f;3f|];[|4f;5f;6f|]];;
val data: float32 array list = [[|1.0f; 2.0f; 3.0f|]; [|4.0f; 5.0f; 6.0f|]]

> let ctx = new MLContext();;
val ctx: MLContext

> let dv = ctx.Data.LoadFromEnumerable data;;
System.ArgumentOutOfRangeException: Could not determine an IDataView type and registered custom types for member SyncRoot (Parameter 'rawType')
   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(String name, Type rawType, IEnumerable`1 attributes, Boolean& isVector, Type& itemType)
   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(MemberInfo memberInfo, Boolean& isVector, Type& itemType)
   at Microsoft.ML.Data.SchemaDefinition.Create(Type userType, Direction direction)
   at Microsoft.ML.Data.InternalSchemaDefinition.Create(Type userType, Direction direction)
   at Microsoft.ML.Data.DataViewConstructionUtils.CreateFromEnumerable[TRow](IHostEnvironment env, IEnumerable`1 data, SchemaDefinition schemaDefinition)
   at Microsoft.ML.DataOperationsCatalog.LoadFromEnumerable[TRow](IEnumerable`1 data, SchemaDefinition schemaDefinition)   at <StartupCode$FSI_0006>.$FSI_0006.main@() in C:\Users\luquinta\stdin:line 5
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
Stopped due to error

The reason this is might be happening is, ML.NET uses the property name / member as the column name.

If you're using anonymous types, you can do this and it works:

> let dataB = [{|x=[|1f;2f;3f|]|};{|x=[|4f;5f;6f|]|}];;
val dataB: {| x: float32 array |} list =
  [{ x = [|1.0f; 2.0f; 3.0f|] }; { x = [|4.0f; 5.0f; 6.0f|] }]

> let dv = ctx.Data.LoadFromEnumerable dataB;;
val dv: IDataView

Your method works as well @thomasd3 if you're using records. However, you could use this overload so you don't have to specify all the columns.

https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.data.loadcolumnattribute.-ctor?view=ml-dotnet#microsoft-ml-data-loadcolumnattribute-ctor(system-int32-system-int32)

Alternatively, you could also use slicing notation for LoadColumn

LoadColumn(cols[|1..|]
thomasd3 commented 4 months ago

@luisquintanilla, I assemble several float32 arrays into one and depending on the model being trained, I have different lengths (the code is quite generic).

In this case:

type DataRow =
    {
        [<ColumnName "Label"; LoadColumn(0)>]
        Label: bool
        [<ColumnName "Features"; LoadColumn(Features[|1..|])
        Features: float32 array
    }

how can I set up LoadColumn for a variable number of columns?

Additionally, I tried this:

            let aa =
                request.Data
                |> Json.deserialize<float32 array list>
                |> List.map (fun x -> {| Label = x[0]; Features = x[1..] |})

            let loadedData = context.Data.LoadFromEnumerable aa

so, it will load the data properly here; but then, on the training code:

            // shuffle the data
            let shuffledData = context.Data.ShuffleRows (loadedData)

            // cache the data
            let data = context.Data.Cache(shuffledData)

            // define the pipeline
            let settings = BinaryExperimentSettings()
            settings.MaxExperimentTimeInSeconds <- uint request.TimeAllowed.TotalSeconds

            // create the experiment
            let experiment = context.Auto().CreateBinaryClassificationExperiment(settings)

            // train the model
            let result = experiment.Execute (data, labelColumnName = "Label")

but then I get this error:

Schema mismatch for feature column 'Features': expected Vector, got VarVector (Parameter 'inputSchema')

If I try to make the Vector of fixed length, using:

                |> List.map (fun x -> {| Label = x[0]; Features = Vector<float32>(x[1..]) |})

then I get:

Could not determine an IDataView type and registered custom types for member Features (Parameter 'rawType')

when loading the data.

luisquintanilla commented 4 months ago

I think this is happening because you need to add the VectorType attribute to your Features column. In it, you specify the number of columns.

https://learn.microsoft.com/dotnet/machine-learning/how-to-guides/load-data-ml-net#annotating-the-data-model-with-column-attributes.

type DataRow =
    {
        [<ColumnName "Label"; LoadColumn(0)>]
        Label: bool
        [<ColumnName "Features"; LoadColumn(Features[|1..|];VectorType(20))
        Features: float32 array
    }

In this case, I put 20, but you could set that to however many columns are your features.

thomasd3 commented 4 months ago

In this case, it won't compile:

image
luisquintanilla commented 4 months ago

In this case, it won't compile:

image

What's Features in your example? Is it an array containing the indices you want to load?

thomasd3 commented 4 months ago

It’s a float32 array with all the training data for that row

luisquintanilla commented 4 months ago

It’s a float32 array with all the training data for that row

Okay. I think that might be the issue. Those need to be an array of row indices, not the data itself.

You can also use the overload LoadColumn(int start, int end). Where start is the index of the column your data begins, and end is the index of the last column you want to read.

thomasd3 commented 4 months ago

Ah yes, that compiles! I didn't know that overload was there. Thanks a lot!

As a quick side question, can I also load a model from memory? I'm currently loading it from a db (I've stored the zip file) and I need to save it on disk to load it:

    let loadModelAsync (postgresConnectionString: string) (coreName: string) (coreVersion: int) (ticker: Ticker) (intervals: string) (modelName: string) =
        asyncResultOption {

            // get the model from the database
            let! model = MLModelDatabase.getModelAsync postgresConnectionString coreName coreVersion ticker intervals modelName

            // build a filename
            let modelDataFilename = nanoGuid()

            // write the model to a file
            do! File.WriteAllBytesAsync(modelDataFilename, model.Model)

            // create the MLContext
            let context = MLContext()

            // load the model
            let model, _ = context.Model.Load modelDataFilename

            // erase the file
            File.Delete modelDataFilename

            // return the model
            return model
        }

Which is a similar problem to when I was doing the training.

I guess the 'LoadWithDataLoader' is probably what is worth investigating:

image