Define Data Lake Structure

caiocampoos commented 7 months ago

In our project we want to receive data as is, and be able to run many kinds of analysis from the same data. For that it would be interesting to tackle this project as a combination of Data Lake + Analytics Dashboard.

On this issue we will discuss and define the following:

Data Lake structure
tech to use (how to build our datalake)
Interface (how to interact with data)
visualization (how to visualize our data)

The focus is to define broad scope of project, and later go in more detailed issues on each topic.

caiocampoos commented 7 months ago

Data Lake References

https://aws.amazon.com/what-is/data-lake/

calilisantos commented 7 months ago

Great, Some suggestions for the bullet points:

Delta Lake structure: Based uppon general suggestions to the theme, some guides to define this topic could be:

Mark	Decisions
Storage	1. Provide in cloud or on premise? 2. What's the access and governance assumptions? 3. What's the estimated capability needed? (current and expected)
Compute	1. What's the volumetry expected? (dailiy/monthly, ...) 2. Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...) 3. Whats the team maturity in data products and projects? 4. Is possible apply a budget to the data infrastructure capabilities?

Tech to use: Some suggestions based on the previous marks:

Solution	Function	Open Source	On premise based	Cloud based
Delta Lake	Lake architeture focuses on processing and storage optimization	Yes	Yes	With integrations in almost all
Apache Iceberg	Lake architeture based focuses in processing and querying optimization and colud integration	Yes	Yes	Yes
Apache Hudi	Lake structure focuses on streaming process	Yes	Yes	Yes
Cloud solutions	The main providers has their own solutions and integrations witth all of the above options	No	No, but with many integrations and data products complements	Well...
Databricks	The environment to explore data lake, with integrations and solutions to data engineer, science and analytics. Apply Delta Lake architeture. Could be expensive	No	No, but with many integrations	Yes
Snowflake	The main Databricks concurrent, with some advantages to analysis solutions	No	No but with more integrations than databricks to this environment	Yes

Interface: Python is the reference for data. Point. All the previous solutions are python based, but Java, R, Scala are others good interfaces. All of them have frameworks like Apache Spark in common. But the volumetry specially and the tech infrastructure could guide this point
Visualization: The previous answers could address this, but was gain maturity the python package streamlit to this purpose, with the open source advantages The others populars solutions are: Qlik, Tableau, Power BI, Looker, and others.

caiocampoos commented 7 months ago

Provide in cloud or on premise?

What's the access and governance assumptions?

What's the estimated capability needed? (current and expected)

For now storage is not really important, the idea is that for any solution, we can plug any data source in it, so for development we can use a big mock file. Governance is not important now also, and capability we can think on something that can range from low to mid. We don't need to worry with huge scale, but it is nice to have something that can scale to bigger volumes.

caiocampoos commented 7 months ago

Compute 1. What's the volumetry expected? (dailiy/monthly, ...)

Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...)

Whats the team maturity in data products and projects?

Is possible apply a budget to the data infrastructure capabilities?

We can investigate some ML integration, altho is not the main focus of the project i see no issues with including something if does not go off the scope.

There is no team, we are developing to learn, we are assuming any seniority levels.

We should focus always on self hosted, open source and free solutions, the main focus of this project is produce a reproducible and free option, that is as close as possible of a fully managed Data Lake and Analytics Platform, following principles of bring your own database, it should be able to plug in any data source with no cost.

caiocampoos commented 7 months ago

Interface: Python is the reference for data. Point. All the previous solutions are python based, but Java, R, Scala are others good interfaces. All of them have frameworks like Apache Spark in common. But the volumetry specially and the tech infrastructure could guide this point

Visualization: The previous answers could address this, but was gain maturity the python package streamlit to this purpose, with the open source advantages The others populars solutions are: Qlik, Tableau, Power BI, Looker, and others.

We are propably betting on python for short term, but also expose a public api so we can not only process butt also consume data after processing, the idea is to have a really accessible and customizable interface for unstructured data.

caiocampoos / data-playground

Define Data Lake Structure #2

Data Lake References