AlertaDengue / AlertTools

GNU General Public License v3.0
1 stars 0 forks source link

Improve sql queries : case aggregation #11

Open claudia-codeco opened 4 months ago

claudia-codeco commented 4 months ago

Problem

Currently, line list of cases are queried and then aggregated. In order to include group_by in the sql query, we need to compute the epiweek.

Why do that

To optimize consultation, this is also useful for the webpage

current code (AlertTools::getCases.R)

     comando <- paste("SELECT * from \"Municipio\".\"Notificacao\" WHERE dt_digita <= '",lastday, 
                         "' AND dt_digita >= '",firstday, "' AND municipio_geocodigo IN (", sqlcity, 
                         ") AND cid10_codigo IN(", sqlcid,")", sep="")

        dd <- dbGetQuery(datasource,comando)

        dd$se_sin_pri <- epiweek(as.Date(dd$dt_sin_pri, format = "%Y-%m-%d"))
            dd <- dd %>% 
                  mutate(ano_sinpri = epiyear(dt_sin_pri),
                         SE = ano_sinpri*100+se_sin_pri)

 casos <- dd %>% 
            group_by(municipio_geocodigo, SE) %>%
            summarise(
                  casos = length(classi_fin),
                  cas_prov = sum(classi_fin != 5, na.rm = TRUE),
                  cas_lab = sum(classi_fin != 5 & criterio == 1 , na.rm = TRUE))
fccoelho commented 4 months ago

I suggest creating a view on the database like this (untested code):

CREATE VIEW casos as SELECT 
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid});

@claudia-codeco, is this the query you need?

claudia-codeco commented 4 months ago

almost there. we need to group_by epiweek using dt_sin_pri

fccoelho commented 4 months ago

It seems that first creating this aggregation by municipio_geocódigo and epiweek as a separate view, named say weekly_cases_by_city will simplify the query above which can then select from this view, instead of "Municipio"."Notificação".

@luabida can you finalize this, please?

claudia-codeco commented 4 months ago

Yes, the final table should have the following columns

municipio_geocodigo year week (epiweek) (1 to 53) casos casos_provaveis casos_lab

fccoelho commented 4 months ago

So it seems that the view has to be something like this:

CREATE VIEW casos as SELECT 
    municipio_geocodigo,
    year,
    SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND classi_fin <> '') AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita <= {lastday} AND dt_digita >={firstday} AND municipio_geocodigo IN ({sqlcity}) AND cid10_codigo IN({sqlcid})
GROUP by municipio_geocodigo, year, SE;
luabida commented 3 months ago

@fccoelho @claudia-codeco this is the result from the query with minor adjustments in the column names:

SELECT
    municipio_geocodigo,
    ano_notif,
    se_notif,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03' 
    AND municipio_geocodigo IN (3304557) 
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, se_notif;

image

The only thing I didn't quite understand is the purpose of having the column epiweek in the Copernicus table, but the table on production has just completed the UPDATE in every row with the epidemiological week:

dengue=# select date, epiweek, geocodigo from weather.copernicus_brasil limit 1;
    date    | epiweek | geocodigo
------------+---------+-----------
 2016-12-13 | 201650  |   2209955
(1 row)
fccoelho commented 3 months ago

There is only one issue: we cannot use se_notif for the group by we need to use the epiweek of dt_sinpri.

So unless we find a way to calculate the epiweek using SQL, we may need to call a pl-python function in this view.

The definition of EpiWeek is this (according to PAHO):

The epidemiological week begins on Sunday and ends on Saturday. The first epidemiological week of the year ends on the first Saturday of January, provided that it falls at least four or more days into the month. Therefore, the first epidemiological week may actually begin in December of the previous year.
fccoelho commented 3 months ago

Adding the epiweek in copernicus table is important to allow for grouping by epiweek.

luabida commented 3 months ago

I tried to keep the exact logic as the epiweek does, the system parameter is not quite necessary, but it may be portable to windows if necessary too

SQL Function with plpython3u:

CREATE OR REPLACE FUNCTION extract_SE(date DATE)
RETURNS INT AS $$
from datetime import date as dt

def _system_adjustment(system: str) -> int:
    systems = ("iso", "cdc")  # Monday, Sunday
    return systems.index(system.lower())

def _year_start(year: int, system: str) -> int:
    adjustment = _system_adjustment(system)
    mid_weekday = 3 - adjustment  # Sun is 6 .. Mon is 0
    jan1 = dt(year, 1, 1)
    jan1_ordinal = jan1.toordinal()
    jan1_weekday = jan1.weekday()
    week1_start_ordinal = jan1_ordinal - jan1_weekday - adjustment
    if jan1_weekday > mid_weekday:
        week1_start_ordinal += 7
    return week1_start_ordinal

def fromdate(date: dt, system: str = "cdc") -> int:
    if isinstance(date, str):
        date = dt.fromisoformat(date)
    year = date.year
    date_ordinal = date.toordinal()
    year_start_ordinal = _year_start(year, system)
    week = (date_ordinal - year_start_ordinal) // 7
    if week < 0:
        year -= 1
        year_start_ordinal = _year_start(year, system)
        week = (date_ordinal - year_start_ordinal) // 7
    elif week >= 52:
        year_start_ordinal = _year_start(year + 1, system)
        if date_ordinal >= year_start_ordinal:
            year += 1
            week = 0
    week += 1
    return int(str(year) + f"{week:02d}")

return fromdate(date, "cdc")
$$ LANGUAGE plpython3u;

Updated query:

SELECT
    municipio_geocodigo,
    ano_notif,
    extract_SE(dt_sin_pri) as SE,
    count(classi_fin) FILTER(WHERE classi_fin IS NOT NULL) AS casos_suspeitos,
    count(classi_fin) FILTER(WHERE classi_fin <> 5) AS casos_provaveis,
    count(classi_fin) FILTER(WHERE classi_fin <> 5 AND criterio = 1) as casos_lab
FROM  "Municipio"."Notificacao"
WHERE
    dt_digita BETWEEN '2023-01-01' AND '2023-01-03'
    AND municipio_geocodigo IN (3304557)
    AND cid10_codigo IN ('A90')
GROUP by municipio_geocodigo, ano_notif, extract_SE(dt_sin_pri);

Result: image

fccoelho commented 3 months ago

We should add this python function to the setup script for the infodengue database so that we always have access to it. in case of a redeploy somewhere.