Refactor `do_process` and `task`

wmalpica commented 3 years ago

Current state The do_process function interface currently takes std::shared_ptr<ral::cache::CacheMachine> output as input and additionally a task also takes in std::shared_ptr<ral::cache::CacheMachine> output in the contructor. Originally, the idea was that a task would know by itself what to do with an output, and that not necessarily be inside the do_process function of every kernel. But in several places this idea is not followed, because it cant be followed or because other APIs would not permit it. For example:

kernels that have more than one output, such as the JoinPartitionKernel has two outputs and therefore two output CacheMachines, therefore, it would not be able to just use the output provided to the task or the do_process function to be able to do the right thing with the output.
distribution_kernels use functions like scatter and broadcast which also take in an output CacheMachine. Right now these functions are taking the output cache defined by the kernel and not necessarily the one passed into the do_process function.

What we want to do What is requested by this feature is primarily to standardize better the pattern of how and what to do with the outputs of tasks.

In particular we should make the do_process function return the outputs of the task and then the task run function or the process function in the kernel interface class would take the output and put it in a CacheMachine.

Currently a task returns a status and optionally some data in the case of a failure. It should change so that it can always return the output data. Some do_process functions will need to return a CacheData, while others would rather return a BlazingTable. We would have to make it so that either they all return a CacheData, and therefore the ones that would return a BlazingTable, would first place that in a CacheData, or we have the task_result be able to contain either and know what to do. Additionally the task_result would actually have to be able to reuturn more than one output to support kernels that have more than one output CacheMachine. And therefore, if you are returning more than one output, you need to be able to identify what to do with each.

Why

These changes would set the architecture more where it was intended, which is to have a task simply do processing and not anything else related to cacheing
It would allow in the future to be able to do things with the outputs of tasks that do not necessarily mean putting them into an output cache. For example chaining tasks, where the output of one task becomes the input of another without having a CacheMachine in the middle.
Something else we would want to do for logging or estimation purposes, compare the size of the inputs vs the size of the outputs.

felipeblazing commented 3 years ago

If we do this we should seperate the logic of the Cache which holds cachedata and the part of the code that actually moves things between cache layers. This would make it so that we could isolate the holding of information, local to the cache, from the deciding of where that information is going to reside which is query wide.

felipeblazing commented 3 years ago

Process isn't a place that currently supports doing htis by the way because there is one process function for ALL kernels. Perhaps we should define some kind of function that handles this on all kernels. something like kernel do_output function.

BlazingDB / blazingsql

Refactor `do_process` and `task` #1457