Set up environments - Githubissues

VinhDevNguyen commented 4 months ago

Plan we do:

[x] Set up SFTP on laptop server Remember to connect to OpenVPN before connect to ssh
```
ssh padapew@10.0.0.2
```
Password: !HelloPenis123321
[ ] Write docker compose and test run on desktop server:
- [x] Airflow and postgresql
- [x] Pyspark (@VinhDevNguyen )
  - [x] Spark cluster
  - ~~[x] Jupyter notebook + Pyspark that connected to spark cluster -> #4~~ -> Use code-server instead, check out this comment: https://github.com/VinhDevNguyen/end2end_datapipeline_project/issues/4#issuecomment-2227387009
- [x] Kafka with monitoring tools (@ShinVu - After done, please break down what u have done)
  - [x] jmx-exporter
  - [x] Prometheous
  - [x] grafana
- [ ] #3 #7
- [x] Delta Lake (@VinhDevNguyen)
[x] Set up github runner for CI/CD https://github.com/VinhDevNguyen/costco_project/settings/actions/runners/2
[x] Set up webhooks send notification to discord

Additional:

Set up kafka + Spark
Airflow schedule notebook

VinhDevNguyen commented 4 months ago

@ShinVu Need you to explain setup docker for monitoring tools like jmx, grafana, prometheous and how to use it

ShinVu commented 4 months ago

Should we add database as the on-premise database? As the document stated:

Build on-premises database to populate selected dataset which satisfies: Some data can be snapshotted and ingested daily Some data needs to be ingested via near real-time mechanism (e.g.: every 2-3 hours...)
Build an ETL pipeline (using Databricks, ADF...) to ingest data from the database to Azure Data Lake Gen 2 (ADLS), following medallion architecture.

VinhDevNguyen commented 4 months ago

Should we add database as the on-premise database? As the document stated:

Build on-premises database to populate selected dataset which satisfies: Some data can be snapshotted and ingested daily Some data needs to be ingested via near real-time mechanism (e.g.: every 2-3 hours...)

Build an ETL pipeline (using Databricks, ADF...) to ingest data from the database to Azure Data Lake Gen 2 (ADLS), following medallion architecture.

@TranBinhLuatUIT @IAMTOIR Do we need an on-premises database like PostgreSQL or any other database? Right now, we have PostgreSQL installed with Airflow.

VinhDevNguyen / end2end_datapipeline_project

Set up environments #1

Plan we do:

Additional: