dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

Onnx input and output data type #6469

Closed triandco closed 1 year ago

triandco commented 1 year ago

Is your feature request related to a problem? Please describe. I was trying to follow a guide on how to use Onnx model with dotnet. I find it difficult to understand how to translate the data type of input and output into C#. The type as display on Netron seems to be python but not quite. From the example I understand that something like int32[n,1] would be a single int32 value, however, in one of my model, I found the type float32[batch, sequence,768] which is harder to translate.

Describe the solution you'd like Is there any further documentation on what these types are?

Describe alternatives you've considered I opened an issue on Netron's repo asking if there is any documentation on the type and he suggested that I post an issue on MS side.

I appreciate that there is already an opening issue about improving on this documentation. However, is there any quick pointer on this particular issue?

Context The current model was trying to run is an onnx export of msmarco-distilbert-base-tas-b from huggingface. image

Much appreciated

triandco commented 1 year ago

Related to this, I have created a repository documenting what I was trying to do. Unfortunately there is an error that stop me from running the model.

Unhandled exception. System.ArgumentOutOfRangeException: Could not determine an IDataView type and registered custom types for 
member InputIds (Parameter 'rawType')
   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(String name, Type rawType, IEnumerable`1 attributes, Boolean& isVector, Type& itemType)
   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(MemberInfo memberInfo, Boolean& isVector, Type& itemType)   at Microsoft.ML.Data.SchemaDefinition.Create(Type userType, Direction direction)
   at Microsoft.ML.Data.InternalSchemaDefinition.Create(Type userType, Direction direction)
   at Microsoft.ML.Data.DataViewConstructionUtils.CreateFromEnumerable[TRow](IHostEnvironment env, IEnumerable`1 data, SchemaDefinition schemaDefinition)
   at Microsoft.ML.DataOperationsCatalog.LoadFromEnumerable[TRow](IEnumerable`1 data, SchemaDefinition schemaDefinition)       
   at Library.Test.get_prediction_pipeline(String file_path, MLContext mlContext) in D:\Developer\triandco\blau\prototypes\sbert-dotnet\src\App\Lib.fs:line 33
   at Library.Test.run(String file_path) in D:\Developer\triandco\blau\prototypes\sbert-dotnet\src\App\Lib.fs:line 40
   at <StartupCode$App>.$Program.main@() in D:\Developer\triandco\blau\prototypes\sbert-dotnet\src\App\Program.fs:line 5  

I'm still unsure whether this is an issue caused by my lack of understanding or it is actually a bug.

Here's my current Input and Output model


type OnnxInput() =
  [< ColumnName("input_ids") >]
  member val InputIds: int64 seq seq = [[]] with get, set

  [< ColumnName("attention_mask")>]
  member val AttentionMask: int64 seq seq = [[]] with get, set

type OnnxOutput() =
  [< ColumnName("last_hidden_state") >]
  member val LastHiddenState: float32 seq seq = [[]] with get, set

I have also tried the OnnxSequenceType attribute just to receive the same error message.

type OnnxInput() =
  [< ColumnName("input_ids"); OnnxSequenceType(typedefof<int64 seq>) >]
  member val InputIds: int64 seq seq = [[]] with get, set

  [< ColumnName("attention_mask"); OnnxSequenceType(typedefof<int64 seq>)>]
  member val AttentionMask: int64 seq seq = [[]] with get, set

type OnnxOutput() =
  [< ColumnName("last_hidden_state"); OnnxSequenceType(typedefof<float32 seq>) >]
  member val LastHiddenState: float32 seq seq = [[]] with get, set
michaelgsharp commented 1 year ago

@luisquintanilla I know we have already discussed making this whole process more intuitive. Any quick pointers to help here though? You are much more familiar with F# than I am.

luisquintanilla commented 1 year ago

Hi @triandco

Thanks for your question. There's a few issues at hand here:

  1. ML.NET expects Tensors (N-dimensional Arrays) to be represented as one-dimensional. For example, I would change the definition of InputIds to:

    member val InputIds: int64 seq = [] with get, set
  2. ML.NET works with Single values, so you might want to perform some mapping

    [< ColumnName("input_ids");  OnnxMapType(typedefof(Int64), typedefof(Single)); 
    OnnxSequenceType(typedefof<Single>);>]
  3. ML.NET only supports only 1 unknown dimension. For example, batch and sequence are both unknown dimensions for input_ids. You know that because instead of having a number, they have a variable name. While you can set one of the dimensions as -1 to indicate unknown, you need to define the rest of the dimensions. While not the same, here is a sample that does that with the BiDAF ONNX model.

Hope this helps.

Another unsolicited tip, you can use Records with F#.

[<CLIMutable>]
type OnnxInput
{
    [<ColumnName("input_ids")>] InputIds : int64 seq
    //...
}
yli223 commented 1 year ago

I am facing the same issue here. My data type is: type: int64[batch,sequence], and I still don't have any clue on how to make it work. Does anyone figure it out?

Thanks in advance for any help!

triandco commented 1 year ago

@yli223 I had some luck with the tips from @luisquintanilla πŸ™‡β€β™‚οΈ, it is actually very helpful in term of understanding the type. Thank you @luisquintanilla. However, I gave up in the end because I kept getting block by some other issue.

In the end, I decided to use an InferenceSession to run the model. You can see my code here

It doesn't have all the type safety of the above implementation, but it works. πŸ˜…

yli223 commented 1 year ago

@triandco Thank you!