BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.93k stars 183 forks source link

Refactor `do_process` and `task` #1457

Open wmalpica opened 3 years ago

wmalpica commented 3 years ago

Current state The do_process function interface currently takes std::shared_ptr<ral::cache::CacheMachine> output as input and additionally a task also takes in std::shared_ptr<ral::cache::CacheMachine> output in the contructor. Originally, the idea was that a task would know by itself what to do with an output, and that not necessarily be inside the do_process function of every kernel. But in several places this idea is not followed, because it cant be followed or because other APIs would not permit it. For example:

What we want to do What is requested by this feature is primarily to standardize better the pattern of how and what to do with the outputs of tasks.

In particular we should make the do_process function return the outputs of the task and then the task run function or the process function in the kernel interface class would take the output and put it in a CacheMachine.

Currently a task returns a status and optionally some data in the case of a failure. It should change so that it can always return the output data. Some do_process functions will need to return a CacheData, while others would rather return a BlazingTable. We would have to make it so that either they all return a CacheData, and therefore the ones that would return a BlazingTable, would first place that in a CacheData, or we have the task_result be able to contain either and know what to do. Additionally the task_result would actually have to be able to reuturn more than one output to support kernels that have more than one output CacheMachine. And therefore, if you are returning more than one output, you need to be able to identify what to do with each.

Why

felipeblazing commented 3 years ago

If we do this we should seperate the logic of the Cache which holds cachedata and the part of the code that actually moves things between cache layers. This would make it so that we could isolate the holding of information, local to the cache, from the deciding of where that information is going to reside which is query wide.

felipeblazing commented 3 years ago

Process isn't a place that currently supports doing htis by the way because there is one process function for ALL kernels. Perhaps we should define some kind of function that handles this on all kernels. something like kernel do_output function.