IBMStreams / streamsx.sparkMLLib

Toolkit for real-time scoring using Apache Spark MLLib library
http://ibmstreams.github.io/streamsx.sparkMLLib
Apache License 2.0
8 stars 15 forks source link

proposal for extension #2

Closed apyasic closed 4 years ago

apyasic commented 9 years ago

The toolkit looks great. Great Job, @ankitpas!

The approach proposed by @ankitpas when streams operator communicates with a corresponding Spark job is very interesting and absolutely make sense. As an extension to an existing toolkit we wanted to propose an alternative approach which is similar to spss toolkit one when a model is "published" by Spark to Streams and used by corresponding Streams operator for real-time data scoring. The advantage of this approach is that it doesn't require spark runtime to be installed or accessible by streams machine. We already have an initial prototype working, and wanted to discuss with you different options on how to integrate it with an existing implementation:

  1. Existing operators extension to support two model execution modes: spark and streams. The first mode is the one implemented by @ankitpas when model is executed by spark job. The second mode is the one we're about to contribute when model is executed by a corresponding streams operator. The advantage of this approach is in a fact that a toolkit user will have the same set of powerful operators to use and would be able easily switch between the two. The disadvantage of this approach is that operator implementation would become much more complex.
  2. New operator(s) that will run spark model. Then we have to decide which namespace to choose for this group of operators. Probably, something like com.ibm.streamsx.sparkmllib.scoring ?
ddebrunner commented 9 years ago

Do the operators really use a Spark job to execute the scoring, I though the operator just executed the model code directly?

ankitpas commented 9 years ago

Currently, each operator initializes a SparkContext which makes it behave as a spark "driver". Hence, there is 1 spark job for every operator under which scoring occurs. By default, the job runs locally in 1 thread but you can optionally pass a spark master location to run it on a cluster.

ddebrunner commented 9 years ago

@ankitpas is probably best placed to say, but I don't think if a remote SparkContext is used that any execution is sent over to the remote cluster. The SparkContext is just used to set up the objects, but the scoring is executing directly against the model. If the tuple was scored using a SPARK job, then wouldn't an RDD have to be created?

I think you are proposing the same concept, so I'm not sure we need two sets of operators, but maybe there's a different implementation approach which could be taken.

ankitpas commented 9 years ago

Actually, some models like the collaborative filtering one store their internal state as RDDs so when we ask for scoring, the action runs against these internal state RDDs. In most cases, RDDs are also created during the model load process. If you pass a remote spark master then the operation may be performed on one or more remote worker machines known to the master node and the results passed back to the driver (which in our case is the operator).

mikespicer commented 9 years ago

@apyasic Can you expand on what you mean by the model being executed by a streams operator? Are you proposing that we do our own implementation for each of the SparkMLLib algorithms? Would this use the Spark code or would we be creating our own version which would have to be maintained in synch with the Spark code?

joergboe commented 4 years ago

No further demand seen.