Closed apyasic closed 4 years ago
Do the operators really use a Spark job to execute the scoring, I though the operator just executed the model code directly?
Currently, each operator initializes a SparkContext which makes it behave as a spark "driver". Hence, there is 1 spark job for every operator under which scoring occurs. By default, the job runs locally in 1 thread but you can optionally pass a spark master location to run it on a cluster.
@ankitpas is probably best placed to say, but I don't think if a remote SparkContext is used that any execution is sent over to the remote cluster. The SparkContext is just used to set up the objects, but the scoring is executing directly against the model. If the tuple was scored using a SPARK job, then wouldn't an RDD have to be created?
I think you are proposing the same concept, so I'm not sure we need two sets of operators, but maybe there's a different implementation approach which could be taken.
Actually, some models like the collaborative filtering one store their internal state as RDDs so when we ask for scoring, the action runs against these internal state RDDs. In most cases, RDDs are also created during the model load process. If you pass a remote spark master then the operation may be performed on one or more remote worker machines known to the master node and the results passed back to the driver (which in our case is the operator).
@apyasic Can you expand on what you mean by the model being executed by a streams operator? Are you proposing that we do our own implementation for each of the SparkMLLib algorithms? Would this use the Spark code or would we be creating our own version which would have to be maintained in synch with the Spark code?
No further demand seen.
The toolkit looks great. Great Job, @ankitpas!
The approach proposed by @ankitpas when streams operator communicates with a corresponding Spark job is very interesting and absolutely make sense. As an extension to an existing toolkit we wanted to propose an alternative approach which is similar to spss toolkit one when a model is "published" by Spark to Streams and used by corresponding Streams operator for real-time data scoring. The advantage of this approach is that it doesn't require spark runtime to be installed or accessible by streams machine. We already have an initial prototype working, and wanted to discuss with you different options on how to integrate it with an existing implementation: