A web API exposing a neural network to detect duplicate entities in knowledge graphs. It uses API key authentication and rate limits requests based on client tiers (FREEMIUM, PREMIUM)
Create a structured, organized layout for the machine learning (ML) portion of the project within the src directory. This structure should adhere to Cookiecutter Data Science standards, supporting modularity, scalability, and separation of concerns. The organization should include folders for data processing, modeling, and results, following best practices for ML workflows.
[!NOTE]
Adapt to this project, probably we are not going to use all folders
User Stories
As a Data Scientist, I want an organized project structure, so I can easily manage data, train models, and generate results.
As a Developer, I want the ML logic to be modular, so it can be reused and integrated efficiently with the API.
Details
Objective: Set up a comprehensive ML project layout within src, making it easy to manage data, train models, preprocess features, and produce reports.
Requirements:
[ ] Within src, create a ml folder following the structure below, organizing ML components.
[ ] Implement Makefile with key commands for managing ML workflows.
[ ] Define essential files such as README.md, requirements.txt, pyproject.toml, and placeholder scripts for key tasks.
Required Directory and File Structure
src/
├── ml/ # Main ML project directory
│ ├── LICENSE # License for the ML project
│ ├── Makefile # Commands for ML workflows, e.g., `make data`, `make train`
│ ├── README.md # Overview of the ML project and setup instructions
│ │
│ ├── data/ # Data directory with raw, interim, processed datasets
│ │ ├── external/ # Data from third-party sources
│ │ ├── interim/ # Intermediate, transformed data
│ │ ├── processed/ # Final datasets ready for modeling
│ │ └── raw/ # Original datasets (unaltered)
│ │
│ ├── docs/ # Documentation, e.g., model descriptions, notes
│ │
│ ├── models/ # Trained models, serialized versions, and model summaries
│ │
│ ├── notebooks/ # Jupyter notebooks for EDA, model experimentation
│ │ └── 1.0-your_initials-description.ipynb
│ │
│ ├── pyproject.toml # Project configuration and dependencies for ML tools
│ │
│ ├── references/ # Manuals, data dictionaries, and relevant references
│ │
│ ├── reports/ # Analysis reports in HTML, PDF, etc.
│ │ └── figures/ # Figures for analysis and reports
│ │
│ ├── requirements.txt # Dependencies for the ML environment
│ │
│ └── {{ module_name }}/ # Core Python module for ML logic
│ ├── __init__.py # Initializes {{ module_name }} as a package
│ ├── config.py # ML-specific configurations and constants
│ ├── dataset.py # Script for data download, transformation, and loading
│ ├── features.py # Functions for feature engineering and selection
│ ├── modeling/ # Model training and prediction
│ │ ├── __init__.py
│ │ ├── train.py # Model training functions
│ │ └── predict.py # Model inference functions
│ └── plots.py # Visualization functions for model insights
Description
Create a structured, organized layout for the machine learning (ML) portion of the project within the
src
directory. This structure should adhere to Cookiecutter Data Science standards, supporting modularity, scalability, and separation of concerns. The organization should include folders for data processing, modeling, and results, following best practices for ML workflows.User Stories
Details
src
, making it easy to manage data, train models, preprocess features, and produce reports.src
, create aml
folder following the structure below, organizing ML components.Makefile
with key commands for managing ML workflows.README.md
,requirements.txt
,pyproject.toml
, and placeholder scripts for key tasks.Required Directory and File Structure
Examples and Notes
Makefile:
README.md:
requirements.txt:
Edge Cases
data/raw/
if required.