climsoft / climsoft-web

Climsoft web application
MIT License
0 stars 6 forks source link

Choosing an ETL/ELT Framework for Our Modules #53

Open Patowhiz opened 3 weeks ago

Patowhiz commented 3 weeks ago

Overview

Our project includes modules that will require common Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) operations. We have the option to use the Arquero JavaScript library or rely on languages that offer powerful ETL/ELT tools, such as SQL, Python, or R.

I propose we rely on SQL for our ETL/ELT operations for the following reasons:

  1. Platform Agnosticism: SQL is supported by numerous powerful analytical engines, ensuring our solution remains platform-agnostic.
  2. Ease of Use: SQL is relatively easier to learn and use compared to more programmatic approaches. This allows more team members, including those with lower coding skills, to contribute effectively.
  3. Performance and Efficiency: Analytical engines are optimized for handling large datasets, and their implementations use tools written in lower-level languages that guarantee optimal performance and efficiency.

After evaluating multiple analytical engines such as Apache Spark, Presto, and DuckDB, I recommend using DuckDB for most of our ETL/ELT operations. DuckDB is in-process OLAP database that excels in a single-node environment, offering high performance for our current user needs. DuckDB is also very well supported by NodeJS, Python and R. This means that we can integrate it in our API and in future in our analytics and visualisations processes that may use Python or R.

In the future, if our users require analysis that can only be efficiently performed in distributed environments, we can transition to Apache Spark, which also provides robust support for ETL/ELT operations in distributed settings using SQL.

Request for Comments

I invite feedback on this proposal. Your insights and suggestions will be invaluable.