Open will-m-buchanan opened 4 years ago
hi @gillbuchanan thanks for opening this issue.
UnaryTransformer
in apache/spark like this one:
https://github.com/apache/spark/blob/76e5294bb65f991988b73c8e6541b4b06d095127/mllib/src/main/scala/org/apache/spark/ml/feature/PolynomialExpansion.scala#L41
as you can see they are overriding the uid during object instantiation via override val uid
. Also you might want to implement setters like this one: https://github.com/apache/spark/blob/76e5294bb65f991988b73c8e6541b4b06d095127/mllib/src/main/scala/org/apache/spark/ml/feature/PolynomialExpansion.scala#L65
so that it won't call the parent class setter. Let me know how that goes.Hey @gillbuchanan,
Thanks for the question, it is indeed true as Luca said that the UUID for transformers needs to be unique, or else you run into serialization issues as you've discovered.
Can I please ask a follow up question, at what stage do you run into this issue? Is there perhaps a quick straightforward example to reproduce the issue you're having?
I tried just changing the uid field to not be static, so that each instance gets its own uid (as it should be anyway), but this causes problems when trying to set Params using transformer.set(myParam, value):
Thanks, Anca
I'm attempting to rebuild our ML pipeline using MLeap to serve our models. Our transformers are written in Java and mostly extend Spark's
UnaryTransformer
class. They all also start with something likeBecause
uid
is static, the variable gets created the first time aPatternNormalizerTransformer
(or whatever transformer) is instantiated, then all subsequent instances use the same value. This means if a pipeline has more than onePatternNormalizerTransformer
they’ll actually both have the sameuid
. This hasn’t proven to be a problem in the past because when Spark serializes its stages it prepends the stage number to the beginning of the uid. For example, in thestages
directory of one of our serialized Spark models we might have stage subdirectories likeDespite both
PatternNormalizerTransformer
s having the same uid – “PatternNormalizerTransformer_011f91059577” – there’s no conflict because of the stage number. MLeap does not seem to prepend this number when serializing though, so I’m running into “file already exists: PatternNormalizerTransformer_011f91059577" type errors when trying to serialize a model that has more than one of a particular transformer type.*I tried just changing the
uid
field to not be static, so that each instance gets its ownuid
(as it should be anyway), but this causes problems when trying to setParam
s usingtransformer.set(myParam, value)
:The param name “nullinputCol” hints as to what is going on. Usually, the param is named “{uid}inputCol”. The fact that here it’s null suggests that the inputCol (and presumably outputCol)
Param
s are created as part of theUnaryTransformer
parent class beforeuid
has been set. At that point theParam
is thought to belong to a transformer with uid “null”. Then once thePatternNormalizerTransformer
is instantiated and given an actualuid
, thatuid
no longer matches the “null” value associated with theParam
and throws an error.So I’m currently stuck. My hope is there is a way to define how MLeap names its stage directories during serialization, i.e. prepending stage index numbers to the uid. Is this possible? Alternatively, are there any other solutions to this issue?
* I thought of creating a transformer that has multiple input/output columns, to avoid having to have more than one instance of a particular transformer in a pipeline, but as these transformers often take different sets of parameters (for example, the
PatternNormalizerTransformer
takes a regex pattern parameter and a case sensitive bool parameter), they would require their own transformers