Choose your execution engine (in-memory/in-database)

amphi-ai / amphi-etl

Visual Data Transformation with Python Code Generation. Low-Code Python-based ETL.

https://amphi.ai

Other

908 stars 44 forks source link

Choose your execution engine (in-memory/in-database) #15

Open simonaubertbd opened 5 months ago

simonaubertbd commented 5 months ago

Hello,

A dreamt feature would be the ability to choose the execution engine. Some dataprep/etl tools have it (like Alteryx or Dataiku). That means two possibilities :

in-memory when the tool retrieve the data on its machine and treats along the differents steps and write. -in-database where data stay in the database and the tool only send sql queries to proceed the different steps. It allows very big volume in a very small time, and does not depend of the resource of the etl machine.

This is hard to develop of course but I think this has definitely this place on the roadmap.

Best regards,

Simon

tgourdel commented 5 months ago

Thanks, definitely on my radar! I'm particularly looking at https://github.com/ibis-project/ibis to enable this.

simonaubertbd commented 2 months ago

@tgourdel From what I know there are two major philosophies: -distinguished tools between the "default" in-memory engine and the "in-database" engine (Alteryx Designer on Premise... but all in-in engine have the same tools, with a few exceptions to write code in pyspark, this kind of stuff) -same tools and a menu to choose the execution engine, either on each tool (Dataiku), either globally (Alteryx Cloud)

If the second one seems promising ("hey, it's transparent for user, he just have to switch the engine" while must of the functions just exist with the default engine), I prefer the first solution because it's more understandable for the user, easier for maintenance, allow mix of in-memory and in-database tools with "interface" between.

For reference, the dedicated in-database tools existing on Alteryx, the data stream in and out allows the bridge between the engines part.

Best regards,

Simon

simonaubertbd commented 2 months ago

@tgourdel Also : this tools are just here to build a query that is passed on the write-data in-db, the browse and the data stream out so it's not that hard to develop

E.g : this workflow

will create a table with this query

 SELECT "Couleur", COUNT(*) AS "Count" FROM
          (SELECT "Couleur", "Nom_Produit" FROM
                (SELECT * FROM 
                        (select "bd_market"."articles".* from "bd_market"."articles")
                AS "a" WHERE "Classe" IS NOT NULL)
           AS "a") 
AS "a" GROUP BY "Couleur"

As you can see, it's nested for each tool, so pretty easy to do. It may not be optimized but as a first approach, it works well.