TIBCOSoftware / catalystml

CatalystML is an open source specification for real-time feature processing, purpose built to transform data for machine learning models.
BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

casting between types #9

Closed abramvandergeest closed 5 years ago

abramvandergeest commented 5 years ago

I was talking with @mellistibco about casting types and controlling the types of output. The two things we were thinking about: 1) it makes sense to have operators that cast between types 2) With the input and output types being explicitly named it makes sense to have auto casting built into the output of the implementation

@fm-tibco @skothari-tibco what do you guys think?

skothari-tibco commented 5 years ago

Don't we already do 'coerce' (which is, type casting) in metadata of operations ? How different will the type casting that you describe, would be? Example of coerce in operation https://github.com/skothari-tibco/operation/blob/master/metadata.go#L24 .

mellistibco commented 5 years ago

@skothari-tibco thats specific to the implementation, as this is a specification, we'll need to capture a generic mechanism to achieve things like type coercion.

abramvandergeest commented 5 years ago

@skothari-tibco - So the concern is that tensorflow requires specific types to be read into its models (float32 versus float 64 will through an error, the types are determined when you build the model) so being able to set the type in the output (i.e. cast the type) would allow us to make sure the output type is what is needed for the model. And this is important since preprocessing into a ML framework is one of the main use-cases for CML.

abramvandergeest commented 5 years ago

So currently we have it in the spec that is type of final output does not match the type of the data it should raise an error. Should this be changed to "cast to type"?

skothari-tibco commented 5 years ago

So as per small talk between me and @abramvandergeest , we have to decide about how much details and abstraction we wanna have in our specification. I think the specification should be more like a blueprint of what we wanna do and not worry too much about types of input/output, implementation should handle that. Whereas, bram suggestion is that we have more of the use case geared specification, where we specify about the types and so on. @mellistibco @fm-tibco any thoughts on this ?

mellistibco commented 5 years ago

I think the spec should dictate that the implementation will output the specified data type from the output definition in the JSON. I don't think there is a need for the person writing the stages to explicitly add cast operations. However, I don't see anything wrong with having cast operations available in the event that the user wants to do something specific when defining the various stages.

This does bring up something interesting, however:

"output": {
    "type": "map",
    "data":{"feat1":"$finalout['0_0']"}
    }

How should this be handled? I don't think the implementation has enough detail here to ensure that each element in the map is of a specific type, in which case the user would need to execute a cast operation on finalout...

Maybe we should add the subtypes of the map? I suppose it could get more complicated if the map is made up of multiple different data types...

"output": {
    "type": {
"rootTyp": "map",
"childTyp": "float32"
},
    "data":{"feat1":"$finalout['0_0']"}
    }
mellistibco commented 5 years ago

Perhaps the easiest thing to do for the initial version is to just throw an error if the final output does not match the specified type, and expect a cast operation to be added in the event the values of a map need to be cast to a specific type (float32, for example)?

skothari-tibco commented 5 years ago

May be something like

{
   "type":"map[string]any",
   "data" : "{\"feat1\":\"$finalout['0_0']\"}"
}

I'm aware that this is very go-like. May be we can also support both map/dict "dict[string]any" (which I think should cover go, python, java ) ? (@abramvandergeest would this create confusion in python as a single dict instance can hold any type).

abramvandergeest commented 5 years ago

I actually was thinking the same thing as @skothari-tibco for the type definition. any shouldn't be a problem with python, map and lists in python can be any type (though numpy/pandas limits arrays/cols to a single type)

mellistibco commented 5 years ago

So perhaps some logic like this:

If the type is a map[]any AND the user expects the data types of each map to actually be different, then they'd need to call casting operations for each specific value above. For example:

"data":{"feat1":"$finalout['0_0']", "feat2":"$something['0_1']"}

and feat2 should be a float64 while feat1 is a float32. This scenario will need to be explicitly handled by the person building the series of stages, etc (in reality, probably a very uncommon scenario).

If we have a type of map[]float64 then the implementation (and the spec should require this) should ensure that the value of each element is cast to float64.

Does that seem reasonable?

skothari-tibco commented 5 years ago

Yes, that seems reasonable.

abramvandergeest commented 5 years ago

In that case I should probably put "any" as a data type with "ONLY FOR OUTPUT" associated with it. I think I can write something like that into the output

abramvandergeest commented 5 years ago

Matt and I talk a little bit more about this. For first pass lets leave it with the error of types don't match. However, in the future, it would be good to cast to the type defined in the output. If the cast is stretching (say string to float) give a warning.