Open zepor opened 3 months ago
The task involves implementing ETL (Extract, Transform, Load) processes using Azure Data Factory to handle NFL data. The solution requires designing ETL scripts to extract data from data providers, transform it into suitable formats, and load it into an Azure SQL Database. Additionally, ETL service principal credentials need to be securely stored in GitHub Secrets.
backend-container/src/etl/__init__.py
This file initializes the ETL module, imports necessary components, and defines a simple ETL pipeline function.
from .extract import extract_data
from .transformations import transform_data
from .load import load_data
from .config import Config
from .credentials import get_service_principal_credentials
config = Config()
credentials = get_service_principal_credentials()
def run_etl_pipeline():
raw_data = extract_data(config.DATA_PROVIDERS['sports_radar']['base_url'], credentials)
transformed_data = transform_data(raw_data)
load_data(transformed_data, config.AZURE_SQL_DATABASE)
backend-container/src/etl/config.py
This file defines configuration settings for the ETL processes, including database connection settings, API endpoints, and ETL process settings.
import os
class Config:
AZURE_SQL_DATABASE = {
'server': os.getenv('AZURE_SQL_SERVER', 'your_server.database.windows.net'),
'database': os.getenv('AZURE_SQL_DATABASE', 'your_database'),
'username': os.getenv('AZURE_SQL_USERNAME', 'your_username'),
'password': os.getenv('AZURE_SQL_PASSWORD', 'your_password'),
'driver': '{ODBC Driver 17 for SQL Server}'
}
DATA_PROVIDERS = {
'sports_radar': {
'base_url': os.getenv('SPORTS_RADAR_BASE_URL', 'https://api.sportradar.us'),
'api_key': os.getenv('SPORTS_RADAR_API_KEY', 'your_api_key')
}
}
ETL_SETTINGS = {
'batch_size': int(os.getenv('ETL_BATCH_SIZE', 1000)),
'retry_attempts': int(os.getenv('ETL_RETRY_ATTEMPTS', 3)),
'retry_delay': int(os.getenv('ETL_RETRY_DELAY', 5))
}
ETL_SERVICE_PRINCIPAL_ID = os.getenv('ETL_SERVICE_PRINCIPAL_ID', 'your_service_principal_id')
ETL_SERVICE_PRINCIPAL_SECRET = os.getenv('ETL_SERVICE_PRINCIPAL_SECRET', 'your_service_principal_secret')
backend-container/src/etl/etl_process.py
This file implements the ETL process, encapsulating extraction, transformation, and loading steps.
import logging
from .extract import extract_data
from .transformations import transform_data
from .load import load_data
from .config import Config
from .credentials import get_service_principal_credentials
class ETLProcess:
def __init__(self):
self.config = Config()
self.credentials = get_service_principal_credentials()
self.logger = logging.getLogger('ETLProcess')
logging.basicConfig(level=logging.INFO)
def run(self):
try:
self.logger.info("Starting ETL process")
raw_data = extract_data(self.config.DATA_PROVIDERS['sports_radar']['base_url'], self.credentials)
transformed_data = transform_data(raw_data)
load_data(transformed_data, self.config.AZURE_SQL_DATABASE)
self.logger.info("ETL process completed successfully")
except Exception as e:
self.logger.error(f"ETL process failed: {e}")
raise
if __name__ == "__main__":
etl_process = ETLProcess()
etl_process.run()
backend-container/src/etl/pipelines.py
This file defines and updates ETL pipelines to align with the updated SportsRadar API documentation.
import logging
from .extract import extract_data
from .transformations import transform_data
from .load import load_data
from .config import Config
from .credentials import get_service_principal_credentials
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def etl_pipeline():
try:
logger.info("Starting data extraction...")
raw_data = extract_data(Config.DATA_PROVIDERS['sports_radar']['base_url'], get_service_principal_credentials())
logger.info("Data extraction completed.")
logger.info("Starting data transformation...")
transformed_data = transform_data(raw_data)
logger.info("Data transformation completed.")
logger.info("Starting data loading...")
load_data(transformed_data, Config.AZURE_SQL_DATABASE)
logger.info("Data loading completed.")
except Exception as e:
logger.error(f"ETL pipeline failed: {e}")
raise
if __name__ == "__main__":
etl_pipeline()
backend-container/src/etl/extract.py
This file implements the data extraction logic.
import requests
import pandas as pd
from .config import Config
def extract_data(api_url, credentials):
headers = {
'Authorization': f'Bearer {credentials["api_key"]}'
}
response = requests.get(api_url, headers=headers)
if response.status_code == 200:
data = response.json()
return pd.DataFrame(data)
else:
response.raise_for_status()
def extract_nfl_data():
nfl_endpoint = f"{Config.DATA_PROVIDERS['sports_radar']['base_url']}/nfl"
return extract_data(nfl_endpoint, Config.DATA_PROVIDERS['sports_radar'])
backend-container/src/etl/credentials.py
This file retrieves and uses ETL service principal credentials from GitHub Secrets.
import os
from azure.identity import ClientSecretCredential
from azure.keyvault.secrets import SecretClient
def get_github_secret(secret_name):
secret_value = os.getenv(secret_name)
if not secret_value:
raise ValueError(f"GitHub Secret {secret_name} not found in environment variables.")
return secret_value
def get_service_principal_credentials():
client_id = get_github_secret('ETL_SERVICE_PRINCIPAL_ID')
client_secret = get_github_secret('ETL_SERVICE_PRINCIPAL_SECRET')
tenant_id = get_github_secret('AZURE_TENANT_ID')
credential = ClientSecretCredential(
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret
)
return credential
def get_secret_from_key_vault(vault_url, secret_name):
credential = get_service_principal_credentials()
client = SecretClient(vault_url=vault_url, credential=credential)
secret = client.get_secret(secret_name)
return secret.value
if __name__ == "__main__":
try:
key_vault_url = "https://<your-key-vault-name>.vault.azure.net/"
secret_name = "your-secret-name"
secret_value = get_secret_from_key_vault(key_vault_url, secret_name)
print(f"Retrieved secret: {secret_value}")
except Exception as e:
print(f"Error retrieving secret: {e}")
backend-container/src/etl/load.py
This file implements the data loading logic.
import pyodbc
import logging
from .config import Config
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def load_data(data, db_config):
try:
conn = pyodbc.connect(
f"DRIVER={{ODBC Driver 17 for SQL Server}};"
f"SERVER={db_config['server']};"
f"DATABASE={db_config['database']};"
f"UID={db_config['username']};"
f"PWD={db_config['password']}"
)
cursor = conn.cursor()
for index, row in data.iterrows():
cursor.execute(
"INSERT INTO your_table_name (column1, column2, column3) VALUES (?, ?, ?)",
row['column1'], row['column2'], row['column3']
)
conn.commit()
logger.info("Data loaded successfully into Azure SQL Database.")
except Exception as e:
logger.error(f"Error loading data into Azure SQL Database: {e}")
finally:
if conn:
conn.close()
logger.info("Database connection closed.")
if __name__ == "__main__":
transformed_data = ... # Replace with actual transformed data
load_data(transformed_data, Config.AZURE_SQL_DATABASE)
backend-container/src/etl/transformations.py
This file implements the data transformation logic.
import pandas as pd
def clean_data(df):
df.fillna(method='ffill', inplace=True)
df['date'] = pd.to_datetime(df['date'])
df.drop_duplicates(inplace=True)
return df
def enrich_data(df):
df['new_field'] = df['existing_field'] * 2
return df
def transform_data(raw_data):
cleaned_data = clean_data(raw_data)
enriched_data = enrich_data(cleaned_data)
return enriched_data
if __name__ == "__main__":
raw_data = pd.DataFrame({
'date': ['2021-01-01', '2021-01-02', None],
'existing_field': [10, 20, 30]
})
transformed_data = transform_data(raw_data)
print(transformed_data)
backend-container/src/etl/documentation.md
This file provides comprehensive documentation for the ETL processes.
# ETL Process Documentation
## 1. ETL Process Diagrams
### Overview Diagram
![ETL Overview](images/etl_overview.png)
### Detailed Diagrams
#### Extraction
![Extraction Process](images/extraction_process.png)
#### Transformation
![Transformation Process](images/transformation_process.png)
#### Loading
![Loading Process](images/loading_process.png)
## 2. Data Flow Charts
### Extraction Flow
```mermaid
graph TD;
A[Data Provider] --> B[Extract Data];
B --> C[Raw Data Storage];
graph TD;
C[Raw Data Storage] --> D[Transform Data];
D --> E[Transformed Data Storage];
graph TD;
E[Transformed Data Storage] --> F[Load Data];
F --> G[Azure SQL Database];
# extract.py
import requests
def extract_data(api_url):
response = requests.get(api_url)
data = response.json()
return data
# transformations.py
def transform_data(raw_data):
transformed_data = [process_record(record) for record in raw_data]
return transformed_data
def process_record(record):
return transformed_record
# load.py
import pyodbc
def load_data(transformed_data, connection_string):
conn = pyodbc.connect(connection_string)
cursor = conn.cursor()
for record in transformed_data:
cursor.execute("INSERT INTO table_name (columns) VALUES (values)", record)
conn.commit()
conn.close()
Store the following secrets in GitHub Secrets:
ETL_SERVICE_PRINCIPAL_ID
ETL_SERVICE_PRINCIPAL_SECRET
# credentials.py
import os
ETL_SERVICE_PRINCIPAL_ID = os.getenv('ETL_SERVICE_PRINCIPAL_ID')
ETL_SERVICE_PRINCIPAL_SECRET = os.getenv('ETL_SERVICE_PRINCIPAL_SECRET')
This documentation provides a clear and comprehensive guide for implementing and understanding the ETL processes.
<!--- REPLY FORMAT END -->
[Click here to create a Pull Request with the proposed solution](https://api.codeautopilot.com/create-pr?task=29600&hash=f66e0ecd902ff3055ab21f701e91ee0982dc29def9aa17e2f769672468ad5f2a)
**Files used for this task:**
<details>
<summary>Changes on backend-container/src/etl/__init__.py</summary>
To initialize the ETL module, you need to create an `__init__.py` file in the `backend-container/src/etl/` directory. This file will serve as the initializer for the ETL module, allowing you to import and use the various components of the ETL process (extraction, transformation, loading, etc.) in a structured manner.
Here's a practical proposal for the `__init__.py` file:
1. **Import Necessary Modules**: Import the essential modules that will be used across the ETL process.
2. **Define Initialization Logic**: If there are any global variables or configurations that need to be set up when the module is initialized, define them here.
### Implementation
```python
# backend-container/src/etl/__init__.py
# Import necessary modules for ETL process
from .extract import extract_data
from .transformations import transform_data
from .load import load_data
from .config import ETLConfig
from .credentials import get_service_principal_credentials
# Initialize ETL configuration
config = ETLConfig()
# Retrieve ETL service principal credentials
credentials = get_service_principal_credentials()
# Define a simple ETL pipeline function for demonstration
def run_etl_pipeline():
# Extract data
raw_data = extract_data(config, credentials)
# Transform data
transformed_data = transform_data(raw_data)
# Load data
load_data(transformed_data, config, credentials)
# If needed, you can add more initialization logic here
extract_data
, transform_data
, load_data
, ETLConfig
, and get_service_principal_credentials
) which are essential for the ETL process.run_etl_pipeline
function is defined to demonstrate how the ETL process can be executed in a sequence: extraction, transformation, and loading.This setup ensures that the ETL module is properly initialized and ready to be used in other parts of the application. If there are additional initialization steps or global settings required, they can be added to this file.
Hello,
I would like to express my willingness to contribute a fix for this bug .
Thank you for the opportunity!
Best regards, Pragy Shukla
Go ahead a branch a fix!
John Bonnett Founder & CEO | Love of Football Phone: 309-751-7750 Website: loveoffootball.iohttp://loveoffootball.io/ Email: @.**@.>; @.**@.>
From: Pragy Shukla @.> Sent: Monday, October 21, 2024 7:08 AM To: LoveofSportsLLC/NFL @.> Cc: John Bonnett @.>; Author @.> Subject: Re: [LoveofSportsLLC/NFL] Implement ETL (Extract, Transform, Load) Processes: (Issue #65)
Hello,
I would like to express my willingness to contribute a fix for this bug .
Thank you for the opportunity!
Best regards, Pragy Shukla
— Reply to this email directly, view it on GitHubhttps://github.com/LoveofSportsLLC/NFL/issues/65#issuecomment-2426488192, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BBXLKIIHDH6CVUAOGDWZXKTZ4TVCZAVCNFSM6AAAAABMLTNIPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRWGQ4DQMJZGI. You are receiving this because you authored the thread.Message ID: @.***>
Hi, can you please highlight me what to do here?
you can use the branch attached to this issue
Implement ETL Processes
Description: Implement ETL processes tailored to NFL data. Tasks:
ETL_SERVICE_PRINCIPAL_ID
,ETL_SERVICE_PRINCIPAL_SECRET
. Milestone: ETL processes implemented and data flowing into the warehouse.