The Data Collection Service is a project that DATA472 Central Collection Team used to collect other student individual project data, based on Apache Airflow designed to automate the collection and processing of student data. The project includes multiple DAGs (Directed Acyclic Graphs) and processors that collect data from different sources and store it in a database.
Data-collection-service/
│
├── config/
│ └── config.py # Configuration settings
│
├── dags/
│ ├── are154.py # Individual DAG files
│ ├── dus15.py
│ ├── hpa117.py
│ ├── hra80.py
│ ├── hwa205.py
│ ├── jjm148.py
│ ├── owners_create.py
│ ├── owners_update.py
│ ├── pvv13.py
│ ├── rna104.py
│ ├── ruben.py
│ ├── sss135.py
│ ├── svi40.py
│ ├── tya51.py
│
├── processors/
│ ├── init.py # Initialize processors package
│ ├── are154_processor.py # Individual processor files
│ ├── dus15_processor.py
│ ├── hpa117_processor.py
│ ├── hra80_processor.py
│ ├── hwa205_processor.py
│ ├── jjm148_processor.py
│ ├── owner_processor.py
│ ├── pvv13_processor.py
│ ├── rna104_processor.py
│ ├── ruben_processor.py
│ ├── prisoner_processor.py
│ ├── tya51_processor.py
│
├── webserver_config.py # Web server configuration
├── airflow.cfg # Airflow configuration file
└── README.md # Project documentation
Clone the project repository:
git clone git@github.com:Data472-Individual-Project-Pipeline/Data-collection-service.git
cd Data-collection-service
Install dependencies:
I highly recommend using a virtual environment to install the dependencies
pip install -r requirements.txt
Configure Airflow:
airflow.cfg
file and adjust settings as needed.config/config.py
file with appropriate configuration values.Initialize the Airflow database:
airflow db init
Create an Airflow user:
airflow users create \
--username admin \
--firstname FIRST_NAME \
--lastname LAST_NAME \
--role Admin \
--email admin@example.com
You can download the docker compose file from the airflow official repository here
docker-compose up
Start the Airflow web server:
airflow webserver --port 8080
Start the Airflow scheduler:
airflow scheduler
Access the Airflow UI in your browser at http://localhost:8080.
Configure and enable the desired DAGs:
Here is an example DAG file jjm148.py
:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from processors.jjm148_processor import process_jjm148
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 1,
}
dag = DAG(
'jjm148',
default_args=default_args,
description='JJM148 data collection DAG',
schedule_interval='@daily',
)
t1 = PythonOperator(
task_id='process_jjm148',
python_callable=process_jjm148,
dag=dag,
)
We welcome contributions! Please read the following instructions to get started:
git checkout -b feature-branch
).git commit -am 'Add new feature
).git push origin feature-branch
).If you have any questions, please feel free to contact the Central Collection Team or project maintainer at aemooooon@gmail.com