Serialization filename conflicts when transformers have the same UID

will-m-buchanan commented 4 years ago

I'm attempting to rebuild our ML pipeline using MLeap to serve our models. Our transformers are written in Java and mostly extend Spark's UnaryTransformer class. They all also start with something like

private static String uid = Identifiable$.MODULE$.randomUID("PatternNormalizerTransformer");

Because uid is static, the variable gets created the first time a PatternNormalizerTransformer (or whatever transformer) is instantiated, then all subsequent instances use the same value. This means if a pipeline has more than one PatternNormalizerTransformer they’ll actually both have the same uid. This hasn’t proven to be a problem in the past because when Spark serializes its stages it prepends the stage number to the beginning of the uid. For example, in the stages directory of one of our serialized Spark models we might have stage subdirectories like

00_PatternNormalizerTransformer_011f91059577    
01_PatternNormalizerTransformer_011f91059577    
02_DowncaseTransformer_0f24dea8e78f 
03_NumberNormalizerTransformer_30ce0fc90f82
etc

Despite both PatternNormalizerTransformer s having the same uid – “PatternNormalizerTransformer_011f91059577” – there’s no conflict because of the stage number. MLeap does not seem to prepend this number when serializing though, so I’m running into “file already exists: PatternNormalizerTransformer_011f91059577" type errors when trying to serialize a model that has more than one of a particular transformer type.*

I tried just changing the uid field to not be static, so that each instance gets its own uid (as it should be anyway), but this causes problems when trying to set Params using transformer.set(myParam, value):

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Param null__inputCol does not belong to PatternNormalizerTransformer_a2b7256a4abb.

The param name “nullinputCol” hints as to what is going on. Usually, the param is named “{uid}inputCol”. The fact that here it’s null suggests that the inputCol (and presumably outputCol) Params are created as part of the UnaryTransformer parent class before uid has been set. At that point the Param is thought to belong to a transformer with uid “null”. Then once the PatternNormalizerTransformer is instantiated and given an actual uid, that uid no longer matches the “null” value associated with the Param and throws an error.

So I’m currently stuck. My hope is there is a way to define how MLeap names its stage directories during serialization, i.e. prepending stage index numbers to the uid. Is this possible? Alternatively, are there any other solutions to this issue?

* I thought of creating a transformer that has multiple input/output columns, to avoid having to have more than one instance of a particular transformer in a pipeline, but as these transformers often take different sets of parameters (for example, the PatternNormalizerTransformer takes a regex pattern parameter and a case sensitive bool parameter), they would require their own transformers

lucagiovagnoli commented 4 years ago

hi @gillbuchanan thanks for opening this issue.

it'd be useful if you provide some links in your description so that I can quickly reference the original code (I know there's a UnaryTransformer in spark but it'd be great to link to the lines you are referencing here)
I think I understand what is going on more or less. Intuitively the UUID - being a unique ID - should definitely not be shared among different objects. I think your first intuition to make it not static was right. In order to understand the exceptions you are facing, I'd need more references to the code but I am confident that they can be solved. My suggestion is to have a look at some of the official implementation of UnaryTransformer in apache/spark like this one: https://github.com/apache/spark/blob/76e5294bb65f991988b73c8e6541b4b06d095127/mllib/src/main/scala/org/apache/spark/ml/feature/PolynomialExpansion.scala#L41 as you can see they are overriding the uid during object instantiation via override val uid. Also you might want to implement setters like this one: https://github.com/apache/spark/blob/76e5294bb65f991988b73c8e6541b4b06d095127/mllib/src/main/scala/org/apache/spark/ml/feature/PolynomialExpansion.scala#L65 so that it won't call the parent class setter. Let me know how that goes.

ancasarb commented 3 years ago

Hey @gillbuchanan,

Thanks for the question, it is indeed true as Luca said that the UUID for transformers needs to be unique, or else you run into serialization issues as you've discovered.

Can I please ask a follow up question, at what stage do you run into this issue? Is there perhaps a quick straightforward example to reproduce the issue you're having?

I tried just changing the uid field to not be static, so that each instance gets its own uid (as it should be anyway), but this causes problems when trying to set Params using transformer.set(myParam, value):

Thanks, Anca

combust / mleap

Serialization filename conflicts when transformers have the same UID #663