caiocampoos / data-playground

This project objective is to document a development of a full data analysis platform, we will document code, docs, discussions, best practice and overall decision making as we go.
3 stars 1 forks source link

Define Data Lake Structure #2

Open caiocampoos opened 2 months ago

caiocampoos commented 2 months ago

In our project we want to receive data as is, and be able to run many kinds of analysis from the same data. For that it would be interesting to tackle this project as a combination of Data Lake + Analytics Dashboard.

On this issue we will discuss and define the following:

The focus is to define broad scope of project, and later go in more detailed issues on each topic.

caiocampoos commented 2 months ago

Data Lake References

https://aws.amazon.com/what-is/data-lake/

calilisantos commented 2 months ago

Great, Some suggestions for the bullet points:

Mark Decisions
Storage 1. Provide in cloud or on premise?
2. What's the access and governance assumptions?
3. What's the estimated capability needed? (current and expected)
Compute 1. What's the volumetry expected? (dailiy/monthly, ...)
2. Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...)
3. Whats the team maturity in data products and projects?
4. Is possible apply a budget to the data infrastructure capabilities?
Solution Function Open Source On premise based Cloud based
Delta Lake Lake architeture focuses on processing and storage optimization Yes Yes With integrations in almost all
Apache Iceberg Lake architeture based focuses in processing and querying optimization and colud integration Yes Yes Yes
Apache Hudi Lake structure focuses on streaming process Yes Yes Yes
Cloud solutions The main providers has their own solutions and integrations witth all of the above options No No, but with many integrations and data products complements Well...
Databricks The environment to explore data lake, with integrations and solutions to data engineer, science and analytics. Apply Delta Lake architeture. Could be expensive No No, but with many integrations Yes
Snowflake The main Databricks concurrent, with some advantages to analysis solutions No No but with more integrations than databricks to this environment Yes
caiocampoos commented 2 months ago
  1. Provide in cloud or on premise?
  2. What's the access and governance assumptions?
  3. What's the estimated capability needed? (current and expected)

For now storage is not really important, the idea is that for any solution, we can plug any data source in it, so for development we can use a big mock file. Governance is not important now also, and capability we can think on something that can range from low to mid. We don't need to worry with huge scale, but it is nice to have something that can scale to bigger volumes.

caiocampoos commented 2 months ago

Compute 1. What's the volumetry expected? (dailiy/monthly, ...)

  1. Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...)
  2. Whats the team maturity in data products and projects?
  3. Is possible apply a budget to the data infrastructure capabilities?

We can investigate some ML integration, altho is not the main focus of the project i see no issues with including something if does not go off the scope.

There is no team, we are developing to learn, we are assuming any seniority levels.

We should focus always on self hosted, open source and free solutions, the main focus of this project is produce a reproducible and free option, that is as close as possible of a fully managed Data Lake and Analytics Platform, following principles of bring your own database, it should be able to plug in any data source with no cost.

caiocampoos commented 2 months ago
  • Interface: Python is the reference for data. Point. All the previous solutions are python based, but Java, R, Scala are others good interfaces. All of them have frameworks like Apache Spark in common. But the volumetry specially and the tech infrastructure could guide this point
  • Visualization: The previous answers could address this, but was gain maturity the python package streamlit to this purpose, with the open source advantages The others populars solutions are: Qlik, Tableau, Power BI, Looker, and others.

We are propably betting on python for short term, but also expose a public api so we can not only process butt also consume data after processing, the idea is to have a really accessible and customizable interface for unstructured data.