CodeForPhilly / clean-and-green-philly

Dashboard to target Philly vacant properties for anti-gun violence interventions
https://www.cleanandgreenphilly.org/
MIT License
32 stars 64 forks source link

Task: Reconfigure the Docker setup to incorporate the timeseriesdb extension #997

Open nlebovits opened 1 day ago

nlebovits commented 1 day ago

Add TimescaleDB Extension and Configure Hypertables for Time Series Analysis

Describe the Task

Our current PostgreSQL instance is not optimized for time series data. Currently, the script dumps the existing postgres schema into a backup schema named with the date it was created, and then creates a new data. Instead, we should switch to using a single schema with the timescaledb extension. Then we can convert our main tables to hypertables, partitioned monthly, instead of creating a bunch of backup schemas. Additionally, we want to implement data compression policies for data older than one year to optimize storage.

As an optional improvement, consider adding spatial indexing to tables containing geospatial data. If this is implemented, please document the process and any decisions made.

Another optional but beneficial addition is setting up pg_stat to monitor query performance and track table growth over time.

Acceptance Criteria

Additional Context

Existing Work

I've already put some work into this. Here is:

  1. My docker-compose.yml
services:
  vacant-lots-proj-experimental:
    build:
      context: .
      dockerfile: Dockerfile
    image: vacant-lots-proj-experimental:latest
    environment:
      - GOOGLE_APPLICATION_CREDENTIALS=/app/service-account-key.json
      - CFP_MAPBOX_TOKEN_UPLOADER
      - VACANT_LOTS_DB
      - CLEAN_GREEN_GOOGLE_KEY
      - PYTHONUNBUFFERED=1
      - GOOGLE_CLOUD_BUCKET_NAME
      - GOOGLE_CLOUD_PROJECT
      - CAGP_SLACK_API_TOKEN
    volumes:
      - ./src:/usr/src/app
      - ~/.config/gcloud/application_default_credentials.json:/app/service-account-key.json
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    extra_hosts:
      - host.docker.internal:host-gateway
    network_mode: 'host'

  formatter-experimental:
    build:
      context: .
    volumes:
      - ./src:/usr/src/app
    command: sh -c "pip install ruff && ruff format --exclude '/usr/src/app/awkde/'"
    network_mode: 'host'

  linter-experimental:
    build:
      context: .
    volumes:
      - ./src:/usr/src/app
    command: sh -c "pip install ruff && ruff check --fix --exclude '/usr/src/app/awkde/'"
    network_mode: 'host'

  streetview-experimental:
    build:
      context: .
    environment:
      - GOOGLE_APPLICATION_CREDENTIALS=/app/service-account-key.json
      - CLEAN_GREEN_GOOGLE_KEY
      - VACANT_LOTS_DB
      - GOOGLE_CLOUD_BUCKET_NAME
    volumes:
      - ./src:/usr/src/app
      - ~/.config/gcloud/application_default_credentials.json:/app/service-account-key.json
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    command: sh -c "pipenv run python streetview.py"
    extra_hosts:
      - host.docker.internal:host-gateway
    network_mode: 'host'

  postgres-experimental:
    container_name: cagp-postgres-experimental
    build:
      context: .
      dockerfile: Dockerfile-pg
    environment:
      - PGPORT=5433
      - POSTGRES_PASSWORD
    restart: always
    ports:
      - '5433:5433'
    volumes:
      - database_volume:/var/lib/postgresql/data
      - ./init_pg.sql:/docker-entrypoint-initdb.d/init_pg.sql
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    extra_hosts:
      - host.docker.internal:host-gateway

volumes:
  database_volume:
  1. My Dockerfile
# Use an official Python 3.11 runtime as a parent image
FROM python:3.11.4

# Set the working directory in the container
WORKDIR /usr/src/app

# Install system dependencies for GDAL and Tippecanoe
RUN apt-get update && apt-get install -y \
    libgdal-dev \
    gcc \
    git \
    build-essential \
    libsqlite3-dev \
    zlib1g-dev \
    lsb-release \
    && rm -rf /var/lib/apt/lists/*

# install postgres client 16 for psql and pg_dump executables for backups.
# should match the version used in the other docker file for the postgres install
RUN sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
RUN curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc |  gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
RUN apt update
RUN apt install -y postgresql-client-16

# Set GDAL environment variables
ENV GDAL_VERSION=3.6.2
ENV GDAL_CONFIG=/usr/bin/gdal-config

# Install Pipenv
RUN pip install pipenv

# update pipfile
RUN pipenv lock

# Copy the Pipfile and Pipfile.lock from the src directory
COPY src/Pipfile src/Pipfile.lock ./

# Install the dependencies from Pipfile
RUN pipenv install --deploy --ignore-pipfile

# Clone and build Tippecanoe
RUN git clone https://github.com/felt/tippecanoe.git \
    && cd tippecanoe \
    && make \
    && make install

# Copy the src directory (including awkde)
COPY src/ .

# Install awkde within the Pipenv environment
# Assuming awkde is located in /usr/src/app/awkde
WORKDIR /usr/src/app/awkde
RUN pipenv run pip install .

# Change back to the original working directory
WORKDIR /usr/src/app

# Use Pipenv to run the script
# Adjust the path to your main Python script if needed
CMD ["pipenv", "run", "python", "./script.py"]
  1. My Dockerfile-pg
FROM postgres:16-bullseye

LABEL maintainer="PostGIS Project - https://postgis.net" \
      org.opencontainers.image.description="PostGIS with PostgreSQL 16 and TimescaleDB" \
      org.opencontainers.image.source="https://github.com/postgis/docker-postgis"

ENV POSTGIS_MAJOR 3

# Install dependencies and set up TimescaleDB repository
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
           gnupg \
           postgresql-common \
           apt-transport-https \
           lsb-release \
           wget \
           ca-certificates \
    && yes | /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh \
    && echo "deb https://packagecloud.io/timescale/timescaledb/debian/ $(lsb_release -c -s) main" > /etc/apt/sources.list.d/timescaledb.list \
    && wget --quiet -O - https://packagecloud.io/timescale/timescaledb/gpgkey | gpg --dearmor -o /etc/apt/trusted.gpg.d/timescaledb.gpg \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
           postgresql-16-postgis-3 \
           postgresql-16-postgis-3-scripts \
           timescaledb-2-postgresql-16 \
           postgresql-client-16 \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /docker-entrypoint-initdb.d

# After the TimescaleDB installation, ensure the configuration is updated
RUN echo "shared_preload_libraries = 'timescaledb'" >> /usr/share/postgresql/postgresql.conf.sample
  1. My init_pq.sql
CREATE DATABASE vacantlotdb;
\c vacantlotdb;
CREATE EXTENSION postgis;
CREATE EXTENSION timescaledb;
  1. My new draft of script.py
import os
import sys
from datetime import datetime

import pandas as pd
from config.psql import conn
from data_utils.access_process import access_process
from data_utils.city_owned_properties import city_owned_properties
from data_utils.community_gardens import community_gardens
from data_utils.conservatorship import conservatorship
from data_utils.contig_neighbors import contig_neighbors
from data_utils.deliquencies import deliquencies
from data_utils.dev_probability import dev_probability
from data_utils.drug_crimes import drug_crimes
from data_utils.gun_crimes import gun_crimes
from data_utils.imm_dang_buildings import imm_dang_buildings
from data_utils.l_and_i import l_and_i
from data_utils.nbhoods import nbhoods
from data_utils.negligent_devs import negligent_devs
from data_utils.opa_properties import opa_properties
from data_utils.owner_type import owner_type
from data_utils.park_priority import park_priority
from data_utils.phs_properties import phs_properties
from data_utils.ppr_properties import ppr_properties
from data_utils.priority_level import priority_level
from data_utils.rco_geoms import rco_geoms
from data_utils.tactical_urbanism import tactical_urbanism
from data_utils.tree_canopy import tree_canopy
from data_utils.unsafe_buildings import unsafe_buildings
from data_utils.vacant_properties import vacant_properties
from sqlalchemy import text

# Ensure the directory containing awkde is in the Python path
awkde_path = "/usr/src/app"
if awkde_path not in sys.path:
    sys.path.append(awkde_path)

services = [
    city_owned_properties,
    phs_properties,
    l_and_i,
    rco_geoms,
    tree_canopy,
    nbhoods,
    gun_crimes,
    drug_crimes,
    deliquencies,
    opa_properties,
    unsafe_buildings,
    imm_dang_buildings,
    tactical_urbanism,
    conservatorship,
    owner_type,
    community_gardens,
    park_priority,
    ppr_properties,
    contig_neighbors,
    dev_probability,
    negligent_devs,
]

# Load Vacant Property Data
dataset = vacant_properties()

# Load and join other datasets
for service in services:
    dataset = service(dataset)

# Additional processing and print statements
before_drop = dataset.gdf.shape[0]
dataset.gdf = dataset.gdf.drop_duplicates(subset="opa_id")
after_drop = dataset.gdf.shape[0]
print(
    f"Duplicate dataset rows dropped after initial services: {before_drop - after_drop}"
)

# Add Priority Level
dataset = priority_level(dataset)
print("Distribution of priority level:")
print(dataset.gdf["priority_level"].value_counts())

# Add Access Process
dataset = access_process(dataset)
print("Distribution of access process:")
print(dataset.gdf["access_process"].value_counts())

before_drop = dataset.gdf.shape[0]
dataset.gdf = dataset.gdf.drop_duplicates(subset="opa_id")
after_drop = dataset.gdf.shape[0]
print(f"Duplicate final dataset rows dropped: {before_drop - after_drop}")

# Add create_date column
today_date = datetime.now().strftime("%m-%d-%Y")
dataset.gdf["create_date"] = today_date
dataset.gdf["create_date"] = pd.to_datetime(dataset.gdf["create_date"])

os.makedirs("tmp", exist_ok=True)
dataset.gdf.to_file("tmp/full_dataset.geojson", driver="GeoJSON")
print("Final dataset saved to tmp/ folder.")

# back up old tiles file whether we are reloading data or not
# if backup is None:
#     backup = BackupArchiveDatabase()
# backup.backup_tiles_file()

# Commit the data to PostgreSQL
dataset.gdf.to_postgis("vacant_properties_end", conn, if_exists="replace", index=False)
conn.commit()
print("Data committed to PostgreSQL.")

# Check if the table is already a hypertable and create it if not
with conn.begin():  # Use conn directly if it's an existing Connection object
    result = conn.execute(
        text(
            "SELECT * FROM timescaledb_information.hypertables WHERE hypertable_name = 'vacant_properties_end';"
        )
    )
    if result.rowcount == 0:
        conn.execute(
            text(
                "SELECT create_hypertable('vacant_properties_end', 'create_date', migrate_data => true);"
            )
        )
        print("Table converted to a TimescaleDB hypertable.")
    else:
        print("Table is already a TimescaleDB hypertable.")

# Post to Mapbox
# dataset.build_and_publish(tiles_file_id_prefix)

# if we are reloading, run the diff report, then archive the backup and finally prune old archives
# if FORCE_RELOAD:
#     diff_report = DiffReport(timestamp_string=backup.timestamp_string)
#     diff_report.run()
#     backup.archive_backup_schema()
#     conn.commit()
#     backup.prune_old_archives()
#     conn.commit()

conn.close()

# except Exception as e:
#     error_message = f"Error in backend job: {str(e)}\n\n{traceback.format_exc()}"
#     send_error_to_slack(error_message)
#     raise  # Optionally re-raise the exception

Please create a draft PR that includes strong documentation, including:

rmartinsen commented 17 hours ago

I'd love to take a shot at this one and get involved with the project in general. After talking last night and looking through the issue I'm starting to get a feel for it.

One caveat: I'll be out of town this weekend and won't be able to really sink my teeth into it until the latter part of next week. Will that be an issue?

nlebovits commented 16 hours ago

Not an issue at all @rmartinsen ! I've assigned the ticket to you.