Open SkinnyMan32 opened 5 years ago
What am i doing wrong?
I believe you are not doing anything wrong, but unfortunately I also believe that currently it's not possible to do what you want to do (i.e. using a loaded CustomMappingTransformer
whose InputSchemaDefinition
originally had a ColumnName
which was different from its MemberName
).
The problem is that when saving the CustomMappingTransformer
, actually only its contractName
gets saved, but the InputSchemaDefinition
and AddedSchema
aren't saved (see code). So all the information of those SchemaDefinitions
is lost, and, in your particular case, when loading back the CustomMappingTransformer
, the ColumnName
of the InputSchemaDefinition
defaults to the MemberName
"Words" instead of the custom "Tokens" name you used when calling your StemText()
extension method.
Even more, the current implementation of SchemaDefinition doesn't include methods to save objects of that class; and currently there are no plans of implementing them. So I am sad to say there's no immediate solution to the bug you've found.
On the other hand, as you probably know, there are workarounds (although I will still explain them here, in case somebody else find it useful in the future). The only way to make this work is to actually have a MemberName
that matches the ColumnName
from the beginning… so you could:
Either rename your Words
member in StemmerInput
as Tokens
Or change your "Tokens" strings in your pipeline definition to be "Words"
var pipeline = ml.Transforms.Text.NormalizeText("Normalized", "Description")
.Append(ml.Transforms.Text.TokenizeIntoWords("Words", "Normalized"))
.Append(ml.Transforms.StemText("Stemmed", "Words"));
So even if a workaround was known for this issue, the underlying problem of not serializing the SchemaDefinition
objects still remains, and it's a bug that should be fixed eventually.
Furthermore, if PR #4676 gets merged to add the StatefulCustomMappingTransformer
, then this issue might also affect that.
I am also adding the "Breaking API change" label to this, in case this issue of serializing SchemaDefinitions isn't fixed before a new major ML.NET release gets planned. It might be worth it to explore the possibility of removing the SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null
parameters on the CustomMapping
method (link), to avoid users creating these SchemaDefinitions
and encountering this issue down the road.
System information
Issue
I try to use CustomMapping with specified column names. It works fine, but after save/load model I get exception: System.ArgumentOutOfRangeException: "Could not find column 'Words'". 'Words' - the original name of property, not specified by me.
What am i doing wrong?
I can fix it by using only original names, it is not comfortable in some cases.
Source code