PecanProject / bety-qaqc

code for checking data and implementing integrity constraints
2 stars 2 forks source link

Fix erroneous or obsolete dateloc values in traits table #5

Open gsrohde opened 8 years ago

gsrohde commented 8 years ago

I'll use this issue to describe categories of problem entries and my proposal for fixing them.

gsrohde commented 8 years ago

Category 1: Bulk-uploaded traits [1353 out of 32504 rows (~4%)]

Description: The bulk upload wizard sets the date without setting the dateloc or timeloc values (this should be fixed). Moreover, it sets the date value by taking the date string given in the CSV file, interpreting this as midnight in Illinois time on that date, and then converting to the corresponding UTC time. It should be normalized to midnight site time, converted to UTC time.

Note: Some of these weren't necessarily created with the bulk upload wizard. But most or all were created automatically in some way because most occur in groups where the created_at value varies by only a fraction of a second.

Query for viewing problem rows:

SELECT
    t.id,
    date,
    sitename,
    s.city,
    state,
    s.country,
    checked,
    u.name as creator,
    t.created_at,
    t.updated_at
FROM
    traits t
LEFT JOIN sites s ON s.id = t.site_id
LEFT JOIN users u ON u.id = t.user_id
WHERE
    t.created_at IS NOT NULL
AND date IS NOT NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
AND time_minute IS NULL
AND dateloc IS NULL
AND timeloc IS NULL
ORDER BY created_at, updated_at;

Solution: Once the time_zone column has been added to the sites table, an update statement can be run to correct the date column values and set the dateloc and timeloc column values to 5 and 9, respectively.

SQL for fixing:

UPDATE traits
SET (dateloc, timeloc, date) = (
    5,
    9,
    (
        (
            (date || ' UTC') :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Chicago'
        ) || ' ' || (
            SELECT
                time_zone
            FROM
                sites s
            WHERE
                s.id = site_id
        )
    ) :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'UTC'
)
    WHERE
        created_at IS NOT NULL
    AND date IS NOT NULL
    AND dateloc IS NULL
    AND time IS NULL
    AND date_year IS NULL
    AND date_month IS NULL
    AND date_day IS NULL
    AND time_hour IS NULL
);
gsrohde commented 8 years ago

Category 2: No date or time information at all [453 out of 32417 rows (~1.3%)]

Description: These are rows where date, time, date_year, date_month, date_day, time_hour, and time_minute are all null. Some of them may have values for dateloc.

Query for viewing problem rows:

SELECT
    t.id,
    dateloc,
    timeloc,
    s.city,
    state,
    s.country,
    c.author || ' ' || c.year as citation,
    u.name as creator,
    t.created_at,
    t.updated_at
FROM
    traits t
LEFT JOIN sites s ON s.id = t.site_id
LEFT JOIN citations c ON c.id = t.citation_id
LEFT JOIN users u ON u.id = t.user_id
WHERE
    date IS  NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
AND time_minute IS NULL
AND (dateloc !=9 OR timeloc != 9);

Solution: Set dateloc and timeloc to 9 for all of these rows.

SQL statement for fixing:

UPDATE traits
SET (dateloc, timeloc) = (9, 9)
WHERE
        date IS NULL
    AND time IS NULL
    AND date_year IS NULL
    AND date_month IS NULL
    AND date_day IS NULL
    AND time_hour IS NULL
    AND time_minute IS NULL
    AND (dateloc !=9 OR timeloc != 9);

(Note: A trigger function on the traits table automatically updates the updated_at timestamp.)

robkooper commented 8 years ago

Check the id's returned to make sure they all fall in the EBI range.

dlebauer commented 8 years ago

Category 1: Add time zone field to sites. Adjust time zone.

Category 2: Set dateloc as appropriate.

No date or time information at all

That is okay. We don't always know this.

gsrohde commented 8 years ago

@robkooper EBI range is those less than one billion, correct? Right not, the highest trait id number is 53236.

robkooper commented 8 years ago

Correct.

From: Scott Rohde notifications@github.com Reply-To: PecanProject/bety-qaqc reply@reply.github.com Date: Monday, May 23, 2016 at 11:24 AM To: PecanProject/bety-qaqc bety-qaqc@noreply.github.com Cc: Rob Kooper kooper@illinois.edu, Mention mention@noreply.github.com Subject: Re: [PecanProject/bety-qaqc] Fix erroneous or obsolete dateloc values in traits table (#5)

@robkooperhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robkooper&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=GPpFhJp-aPNSB_NnySDOIo5hW6fVFqzencXZmCAUqJs&m=zH3HhgqAkP8B3xAR0Ak9irZiyNvBzLj6XdCyWbguy1U&s=kG9XBnkouDRoRdXAdBhJ3uaMgg1L3Hv67oGSxC_zA-s&e= EBI range is those less than one billion, correct? Right not, the highest trait id number is 53236.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_PecanProject_bety-2Dqaqc_issues_5-23issuecomment-2D221022910&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=GPpFhJp-aPNSB_NnySDOIo5hW6fVFqzencXZmCAUqJs&m=zH3HhgqAkP8B3xAR0Ak9irZiyNvBzLj6XdCyWbguy1U&s=xSENMILjh8xdYxwMXJD0gRU4i5gqAU659GJtQv-Cp6A&e=