Open gsrohde opened 8 years ago
Category 1: Bulk-uploaded traits [1353 out of 32504 rows (~4%)]
Description: The bulk upload wizard sets the date without setting the dateloc or timeloc values (this should be fixed). Moreover, it sets the date value by taking the date string given in the CSV file, interpreting this as midnight in Illinois time on that date, and then converting to the corresponding UTC time. It should be normalized to midnight site time, converted to UTC time.
Note: Some of these weren't necessarily created with the bulk upload wizard. But most or all were created automatically in some way because most occur in groups where the created_at
value varies by only a fraction of a second.
Query for viewing problem rows:
SELECT
t.id,
date,
sitename,
s.city,
state,
s.country,
checked,
u.name as creator,
t.created_at,
t.updated_at
FROM
traits t
LEFT JOIN sites s ON s.id = t.site_id
LEFT JOIN users u ON u.id = t.user_id
WHERE
t.created_at IS NOT NULL
AND date IS NOT NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
AND time_minute IS NULL
AND dateloc IS NULL
AND timeloc IS NULL
ORDER BY created_at, updated_at;
Solution: Once the time_zone
column has been added to the sites
table, an update statement can be run to correct the date column values and set the dateloc and timeloc column values to 5 and 9, respectively.
SQL for fixing:
UPDATE traits
SET (dateloc, timeloc, date) = (
5,
9,
(
(
(date || ' UTC') :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Chicago'
) || ' ' || (
SELECT
time_zone
FROM
sites s
WHERE
s.id = site_id
)
) :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'UTC'
)
WHERE
created_at IS NOT NULL
AND date IS NOT NULL
AND dateloc IS NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
);
Category 2: No date or time information at all [453 out of 32417 rows (~1.3%)]
Description: These are rows where date
, time
, date_year
, date_month
, date_day
, time_hour
, and time_minute
are all null. Some of them may have values for dateloc
.
Query for viewing problem rows:
SELECT
t.id,
dateloc,
timeloc,
s.city,
state,
s.country,
c.author || ' ' || c.year as citation,
u.name as creator,
t.created_at,
t.updated_at
FROM
traits t
LEFT JOIN sites s ON s.id = t.site_id
LEFT JOIN citations c ON c.id = t.citation_id
LEFT JOIN users u ON u.id = t.user_id
WHERE
date IS NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
AND time_minute IS NULL
AND (dateloc !=9 OR timeloc != 9);
Solution: Set dateloc
and timeloc
to 9 for all of these rows.
SQL statement for fixing:
UPDATE traits
SET (dateloc, timeloc) = (9, 9)
WHERE
date IS NULL
AND time IS NULL
AND date_year IS NULL
AND date_month IS NULL
AND date_day IS NULL
AND time_hour IS NULL
AND time_minute IS NULL
AND (dateloc !=9 OR timeloc != 9);
(Note: A trigger function on the traits table automatically updates the updated_at
timestamp.)
Check the id's returned to make sure they all fall in the EBI range.
Category 1: Add time zone field to sites. Adjust time zone.
Category 2: Set dateloc as appropriate.
No date or time information at all
That is okay. We don't always know this.
@robkooper EBI range is those less than one billion, correct? Right not, the highest trait id number is 53236.
Correct.
From: Scott Rohde notifications@github.com Reply-To: PecanProject/bety-qaqc reply@reply.github.com Date: Monday, May 23, 2016 at 11:24 AM To: PecanProject/bety-qaqc bety-qaqc@noreply.github.com Cc: Rob Kooper kooper@illinois.edu, Mention mention@noreply.github.com Subject: Re: [PecanProject/bety-qaqc] Fix erroneous or obsolete dateloc values in traits table (#5)
@robkooperhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robkooper&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=GPpFhJp-aPNSB_NnySDOIo5hW6fVFqzencXZmCAUqJs&m=zH3HhgqAkP8B3xAR0Ak9irZiyNvBzLj6XdCyWbguy1U&s=kG9XBnkouDRoRdXAdBhJ3uaMgg1L3Hv67oGSxC_zA-s&e= EBI range is those less than one billion, correct? Right not, the highest trait id number is 53236.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_PecanProject_bety-2Dqaqc_issues_5-23issuecomment-2D221022910&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=GPpFhJp-aPNSB_NnySDOIo5hW6fVFqzencXZmCAUqJs&m=zH3HhgqAkP8B3xAR0Ak9irZiyNvBzLj6XdCyWbguy1U&s=xSENMILjh8xdYxwMXJD0gRU4i5gqAU659GJtQv-Cp6A&e=
I'll use this issue to describe categories of problem entries and my proposal for fixing them.