OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
http://oryx.io
Apache License 2.0
1.79k stars 405 forks source link

Shading of PMML breaks apps that extend MLUpdate #336

Open srowen opened 7 years ago

srowen commented 7 years ago

Quoting from the mailing list:

    I meet a problem when running batch layer.
    I write a batch layer LRScalaUpdate  with scala extends MLUpdate, override  buildModel() and evaluate() method. then i get an exception when running the batch layer.
    I'm wondering why it call MLUpdate.buildModel  instead of my LRScalaUpdate.buildModel. 
    can you give me some suggestions? thank you

17/07/12 14:55:06 INFO cluster.YarnClusterScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool 
17/07/12 14:55:06 INFO scheduler.DAGScheduler: ResultStage 7 (isEmpty at MLUpdate.java:360) finished in 0.093 s
17/07/12 14:55:06 INFO scheduler.DAGScheduler: Job 7 finished: isEmpty at MLUpdate.java:360, took 0.109474 s
Exception in thread "streaming-job-executor-0" java.lang.AbstractMethodError: com.cloudera.oryx.ml.MLUpdate.buildModel(Lorg/apache/spark/api/java/JavaSparkContext;
Lorg/apache/spark/api/java/JavaRDD;Ljava/util/List;Lorg/apache/hadoop/fs/Path;)Loryx/org/dmg/pmml/PMML;
    at com.cloudera.oryx.ml.MLUpdate.buildAndEval(MLUpdate.java:314)
    at com.cloudera.oryx.ml.MLUpdate.lambda$findBestCandidatePath$0(MLUpdate.java:259)
    at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
    at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
...

Oryx shades its use of PMML classes to avoid classpath conflict with Spark. That's fine as it's internal to Oryx.

Except, one key thing I overlooked: MLUpdate actually forms a sort of API outside of the api package, and it does use one PMML class in its signature.

srowen commented 7 years ago

Currently, this can be worked around by shading JPMML in the same way in the client app:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.0.0</version>
    <executions>
        <execution>
            <id>shade</id>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <artifactSet>
                    <includes>
                        <include>your.group:*</include>
                    </includes>
                </artifactSet>
                <relocations>
                    <relocation>
                        <pattern>org.jpmml</pattern>
                        <shadedPattern>oryx.org.jpmml</shadedPattern>
                        <includes>
                            <include>org.jpmml.**</include>
                        </includes>
                    </relocation>
                    <relocation>
                        <pattern>org.dmg</pattern>
                        <shadedPattern>oryx.org.dmg</shadedPattern>
                        <includes>
                            <include>org.dmg.**</include>
                        </includes>
                    </relocation>
                </relocations>
            </configuration>
        </execution>
    </executions>
</plugin>
srowen commented 7 years ago

Any solution I can think of ends up requiring an API change for MLUpdate, which is, while not a formal API, something that people might want to extend. However, it's clear anyone trying to extend it will find it doesn't work anyway.

The modified API would be a little clunky, making people pass Strings instead of PMML objects.

The workaround above isn't that bad and can be documented in the example project. Also, Spark 2.3 will shade JPMML and let Oryx un-shade this, in Oryx 2.6 perhaps.

For now I favor just documenting the workaround and removing all the shading later. It might be the least change.