dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

SchemaDefinition.Create picks up internal fields instead of just public only - affects F# #6209

Open fwaris opened 2 years ago

fwaris commented 2 years ago

This used to work before but now I cannot use CreateFromEnumerable in F# now.

In F#, we define mutable classes by annotating F# 'record' types with CLIMutable:

[<CLIMutable>]
type D = 
   { 
         Data :  float[]
   }

F# compiler generates IL that looks as follows:

public class D  [ interface list  ] 
{
        internal Data@ = Double[];

    public Double[] Data
    {
        get
        {
            return Data@;
        }
        set
        {
            Data@ = value;
        }
    }

     [ more methods including a default constructor ]
}

The SchemaDefinition.Create method picks up both "Data@" and "Data" as fields required by the schema. It should only pickup the public fields.

Please add unit tests for F# as well so these issues are caught earlier.

@dsyme

fwaris commented 2 years ago

I just found out that one workaround for the above issue is to set "ignoreMissingColumns=true" when calling CreateEnumerable(...)

michaelgsharp commented 2 years ago

Glad you found a workaround, but yeah that isn't the behavior we want I don't think.

When did you notice this change? We haven't made a recent changes to SchemaDefinition.Create that I am aware of. I wonder if it has to do with a .NET version change? What version are you using now? Could you also try an older version?

fwaris commented 2 years ago

I think the change was F# tooling rather than ML.Net (rethinking the issue). Some newer version of F# (maybe 5.x) changed the IL generation for mutable records, surfacing this issue.

Totally agree that only public fields should be considered for serialization/deserialization.

Note also that, as-is, when creating a DataView (LoadFromEnumerable…) , the xxx@ fields also get added into the dataset, bloating its size.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows

From: Michael @.> Sent: Monday, June 13, 2022 1:51 PM To: @.> Cc: @.>; @.> Subject: Re: [dotnet/machinelearning] SchemaDefinition.Create picks up internal fields instead of just public only - affects F# (Issue #6209)

Glad you found a workaround, but yeah that isn't the behavior we want I don't think.

When did you notice this change? We haven't made a recent changes to SchemaDefinition.Create that I am aware of. I wonder if it has to do with a .NET version change? What version are you using now? Could you also try an older version?

— Reply to this email directly, view it on GitHubhttps://github.com/dotnet/machinelearning/issues/6209#issuecomment-1154210278, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACGZMFVW53NGRAHXS6NG2TVO5YKFANCNFSM5XKT6S2Q. You are receiving this because you authored the thread.Message ID: @.***>

michaelgsharp commented 2 years ago

Since this is a change by the F# tooling there is nothing that we can do about it unfortunately. Closing this for now, but if things change in the future please create a new issue and we will take another look.

dsyme commented 2 years ago

@michaelgsharp Just to check....

There are some situations where F# needs to emit public fields - notably when emitting code into multiple assemblies in F# Interactive.

My understanding is that serialization of fields should ignore

This is what System.Json and Newtonsoft.Json both do. Does SchemaDefinition do that? It would be great if it does.

See also https://github.com/dotnet/fsharp/pull/13494 which is related

@fwaris Do you know if you hit this issue when using F# Interactive?

Thanks Don

michaelgsharp commented 2 years ago

@dsyme let me reopen the issue for now while I look into that. I am not actually sure off of the top of my head.

Szer commented 4 months ago

Just in case, it is still an issue as all ML.NET guides advice to use records and they are breaking in F# interactive environment. One have to know about hacky workaround passing --multiemit- to FSI which is not a common knowledge

fwaris commented 4 months ago

Check out MLUtils that has some utility code to help use ML.Net from F#. You can use cleanSchema to remove '@' fields from the default schema

#r "nuget: MLUtils"
open MLUtils
type NRow  = { Field1:int; ...}
let xs : NRow list = []
let dv = ctx.Data.LoadFromEnumerable(xs,Schema.cleanSchema typeof<NRow>)

Going the other way - from IDataView to F# record - ignoreMissingColumns flag works.

 ctx.Data.CreateEnumerable<NRow>(dv,false,ignoreMissingColumns=true)
fwaris commented 4 months ago

@michaelgsharp Just to check....

There are some situations where F# needs to emit public fields - notably when emitting code into multiple assemblies in F# Interactive.

My understanding is that serialization of fields should ignore

  • internal fields
  • private fields
  • public fields marked "CompilerGenerated".

This is what System.Json and Newtonsoft.Json both do. Does SchemaDefinition do that? It would be great if it does.

See also dotnet/fsharp#13494 which is related

@fwaris Do you know if you hit this issue when using F# Interactive?

Thanks Don

@dsyme, sorry, did not see this until now. There is a workaround. See the last few messages in this conversation for some notes on the MLUtils package.