Open caiocampoos opened 7 months ago
Great, Some suggestions for the bullet points:
Mark | Decisions |
---|---|
Storage | 1. Provide in cloud or on premise? 2. What's the access and governance assumptions? 3. What's the estimated capability needed? (current and expected) |
Compute | 1. What's the volumetry expected? (dailiy/monthly, ...) 2. Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...) 3. Whats the team maturity in data products and projects? 4. Is possible apply a budget to the data infrastructure capabilities? |
Solution | Function | Open Source | On premise based | Cloud based |
---|---|---|---|---|
Delta Lake | Lake architeture focuses on processing and storage optimization | Yes | Yes | With integrations in almost all |
Apache Iceberg | Lake architeture based focuses in processing and querying optimization and colud integration | Yes | Yes | Yes |
Apache Hudi | Lake structure focuses on streaming process | Yes | Yes | Yes |
Cloud solutions | The main providers has their own solutions and integrations witth all of the above options | No | No, but with many integrations and data products complements | Well... |
Databricks | The environment to explore data lake, with integrations and solutions to data engineer, science and analytics. Apply Delta Lake architeture. Could be expensive | No | No, but with many integrations | Yes |
Snowflake | The main Databricks concurrent, with some advantages to analysis solutions | No | No but with more integrations than databricks to this environment | Yes |
Interface: Python is the reference for data. Point. All the previous solutions are python based, but Java, R, Scala are others good interfaces. All of them have frameworks like Apache Spark in common. But the volumetry specially and the tech infrastructure could guide this point
Visualization:
The previous answers could address this, but was gain maturity the python package streamlit
to this purpose, with the open source advantages
The others populars solutions are: Qlik, Tableau, Power BI, Looker, and others.
- Provide in cloud or on premise?
- What's the access and governance assumptions?
- What's the estimated capability needed? (current and expected)
For now storage is not really important, the idea is that for any solution, we can plug any data source in it, so for development we can use a big mock file. Governance is not important now also, and capability we can think on something that can range from low to mid. We don't need to worry with huge scale, but it is nice to have something that can scale to bigger volumes.
Compute 1. What's the volumetry expected? (dailiy/monthly, ...)
- Is expected machine learning integration with the company products? (Recomendations, alerts based on IA, ...)
- Whats the team maturity in data products and projects?
- Is possible apply a budget to the data infrastructure capabilities?
We can investigate some ML integration, altho is not the main focus of the project i see no issues with including something if does not go off the scope.
There is no team, we are developing to learn, we are assuming any seniority levels.
We should focus always on self hosted, open source and free solutions, the main focus of this project is produce a reproducible and free option, that is as close as possible of a fully managed Data Lake and Analytics Platform, following principles of bring your own database
, it should be able to plug in any data source with no cost.
- Interface: Python is the reference for data. Point. All the previous solutions are python based, but Java, R, Scala are others good interfaces. All of them have frameworks like Apache Spark in common. But the volumetry specially and the tech infrastructure could guide this point
- Visualization: The previous answers could address this, but was gain maturity the python package
streamlit
to this purpose, with the open source advantages The others populars solutions are: Qlik, Tableau, Power BI, Looker, and others.
We are propably betting on python for short term, but also expose a public api so we can not only process butt also consume data after processing, the idea is to have a really accessible and customizable interface for unstructured data.
In our project we want to receive data as is, and be able to run many kinds of analysis from the same data. For that it would be interesting to tackle this project as a combination of Data Lake + Analytics Dashboard.
On this issue we will discuss and define the following:
The focus is to define broad scope of project, and later go in more detailed issues on each topic.