NewGraphEnvironment / fish_passage_skeena_2023_reporting

https://newgraphenvironment.github.io/fish_passage_skeena_2023_reporting
Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

`date_time_start` changing when importing `form_pscis_2023.gpkg` #49

Closed lucy-schick closed 3 months ago

lucy-schick commented 5 months ago

Maybe i'm missing something super simple but i'm stumped

Issue: The date:time is changing in the date_time_start column when reading in the form_pscis_2023.gpkg from Q. Using site 8478 as an example, in Q date_time_start = 2023-09-19 13:30:02 (UTC) but when you read it in it changes to 2023-09-19 06:30:02, it basically added 5hr to the time. If the time is before 12 then it adds 7hrs (see site 8518...super weird. The issue seems to be occurring when reading in the gpkg. I can burn the gpkg to Q and the dates stay the same, but when reading it back in the times change. I am aware that in Q the timezone is set to UTC, which is not correct, it should be PDT (as seen in the 2022 skeena data).

Screen Shot 2024-03-25 at 3 37 37 PM Screen Shot 2024-03-25 at 3 41 31 PM

How to reproduce issue:

  1. read in the backed up form_pscis_2023.csv fromdata/backup/ which has the correct date:time
  2. read in the form_pscis_2023.gpkg
  3. compare date:times. The times in form_pscis_2023.gpkg have +7hrs added to them.

What I've tried:

Reprex:

library(tidyverse)
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(lubridate)

dir_project <- 'sern_skeena_2023'

form_pscis <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg'))
#> Reading layer `form_pscis_2023' from data source 
#>   `/Users/lucyschick/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 59 features and 98 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 1008863 xmax: 958901.9 ymax: 1169499
#> Projected CRS: NAD83 / BC Albers

form_redo <- read_csv("~/Projects/repo/fish_passage_skeena_2023_reporting/data/backup/form_pscis_2023.csv") %>%
  arrange(site_id)
#> Rows: 59 Columns: 98
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (39): crew_members, my_priority, assessment_comment, condition_notes, c...
#> dbl  (39): site_id, moti_chris_culvert_id, moti_chris_culvert_id2, pscis_cro...
#> lgl  (18): moti_chris_culvert_id3, my_citation_key1, my_citation_key2, my_ci...
#> dttm  (1): date_time_start
#> date  (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#join correct date_time_start from the backup csv to form_pscis
redo <- form_pscis %>%
  dplyr::select(-date_time_start) %>% # Exclude the old column
  left_join(form_redo %>% dplyr::select(site_id, date_time_start), by = 'site_id') %>%
  relocate(date_time_start, .after = crew_members)

# this is the same code from pscis_tidy.R. I am running this because we edit the date_time_start column so this could be where the issue is coming from, but I don't think so.
redo <- redo %>%
  #split date time column into date and time
  dplyr::mutate(date_time_start = lubridate::ymd_hms(date_time_start),
                date = lubridate::date(date_time_start),
                time = hms::as_hms(date_time_start)) %>%
  # filter out to get only the records newly created
  filter(!is.na(date_time_start)) %>%
  mutate(
    site_id = case_when(is.na(pscis_crossing_id) ~ my_crossing_reference,
                        T ~ pscis_crossing_id)
  ) %>%
  # remove the form making site
  filter(site_id != '12345') %>%
  arrange(site_id)

#testing time zone
tz(redo$date_time_start)
#> [1] "UTC"

# clean up data fields to make copy and paste to prov template easier
redo <- redo %>%
  # some columns that have yes/no answers have NA values in mergin, need to change to No
  # need to add 'No' as default values to mergin
  mutate(across(contains('yes_no'), ~replace_na(.,'No'))) %>%
  # some numeric fields for CBS have NA values when a user input 0
  mutate(across(c(outlet_drop_meters, outlet_pool_depth_0_01m, culvert_slope_percent, stream_slope),
                ~case_when(crossing_type == 'Closed Bottom Structure' ~replace_na(.,0),
                           TRUE ~ .
                ))) %>%
  # change 'trib' to long version 'Tributary'
  mutate(stream_name = str_replace_all(stream_name, 'Trib ', 'Tributary ')) %>%
  # change 'Hwy' to 'Highway'
  mutate(road_name = str_replace_all(road_name, 'Hwy ', 'Highway '))

# add in which phase of assessment the site is in, did the reassessment sites by hand (because there was only 3-4) so thats not in here.
# This only works when only phase 1 sites have a priority ranking, could use some updating but works for now.
redo <- redo %>%
  mutate(source = case_when(
    my_priority == 'phase 2' ~ 'phase2',
    my_priority == 'high' ~ 'phase1',
    my_priority == 'medium' ~ 'phase1',
    my_priority == 'low' ~ 'phase1',
    is.na(my_priority) ~ 'phase1',
    T ~ source))

# burn cleaned copy to QGIS project gpkg
redo %>%
  sf::st_write(paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg'), append=F, delete_dsn=T)
#> Warning in clean_columns(as.data.frame(obj), factorsAsCharacter): Dropping
#> column(s) time of class(es) hms;difftime
#> Deleting source `/Users/lucyschick/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' using driver `GPKG'
#> Writing layer `form_pscis_2023' to data source 
#>   `/Users/lucyschick/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' using driver `GPKG'
#> Writing 59 features with 98 fields and geometry type Point.

Created on 2024-03-25 with reprex v2.1.0

NewGraphEnvironment commented 5 months ago

Looks like the core issue is that the time is in the right zone (PDT) in the initial form in the Forms group of QGIS but then it was somehow converted to UTC in the Project Specific/Field Work group.

There is no issue when reading in the original form with

form_pscis2 <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/form_pscis.gpkg'))

For this years data we could perhaps do one of the following. ordered based on first impression of preference

  1. convert the times in our paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg') from UTC to values that are equivalent to PDT. Seems like this should be a common thing that needs to happen time to time (punny).
  2. read in the original form and left_join our times from paste0('~/Projects/gis/', dir_project, '/form_pscis.gpkg') to the paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg') object. Would need a QA to be sure we have all the values and that they match the original form.
  3. Convert the times in Q using the Field Calculator.

Once this is dealt with we will need an issue somewhere to address the actual cause of the problem which is the mismatch between the two tables perhaps occurring with below from 01_pscis_tidy.R:

# burn cleaned copy to QGIS project gpkg, the pscis clean section can be repeated again when changes are made in Q
form_pscis_cleaned %>%
  sf::st_write(paste0('../../gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg'), append=F, delete_dsn=T)

This seems to work with only one time deviation of less than a second.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.3, PROJ 9.3.1; sf_use_s2() is TRUE

dir_project <- 'sern_skeena_2023'

form_pscis <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg')) %>% 
  dplyr::mutate(date_time_start_repaired = date_time_start + hours(7)) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis_2023' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 59 features and 91 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 1008863 xmax: 958901.9 ymax: 1169499
#> Projected CRS: NAD83 / BC Albers

form_pscis_og <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/form_pscis.gpkg')) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/form_pscis.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 60 features and 88 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 447409.8 xmax: 1036232 ymax: 1169510
#> Projected CRS: NAD83 / BC Albers

qa <- left_join(
  form_pscis %>% 
    select(contains("date_time"), pscis_crossing_id, my_crossing_reference),
  form_pscis_og %>%
    st_drop_geometry() %>% 
    select(date_time_start_og = date_time_start, pscis_crossing_id, my_crossing_reference),
  by = c("pscis_crossing_id", "my_crossing_reference")
) %>% 
  mutate(time_diff = date_time_start_og - date_time_start_repaired)

max(qa$time_diff)
#> Time difference of 0.4919996 secs
NewGraphEnvironment commented 5 months ago

omg - interesting. Seems like better way to do it is to use force_tz and with_tz since it takes into account daylight savings time!!! Check out the weird thing that hapens when date_time_start is renamed to date_time_start_og though

library(tidyverse)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.3, PROJ 9.3.1; sf_use_s2() is TRUE

dir_project <- 'sern_skeena_2023'

form_pscis <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg')) %>% 
  dplyr::mutate(date_time_start_repaired = date_time_start + hours(7)) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis_2023' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 59 features and 91 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 1008863 xmax: 958901.9 ymax: 1169499
#> Projected CRS: NAD83 / BC Albers

form_pscis_og <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/form_pscis.gpkg')) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/form_pscis.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 60 features and 88 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 447409.8 xmax: 1036232 ymax: 1169510
#> Projected CRS: NAD83 / BC Albers

qa <- left_join(
  form_pscis %>% 
    select(contains("date_time"), pscis_crossing_id, my_crossing_reference),
  form_pscis_og %>%
    st_drop_geometry() %>% 
    select(date_time_start_og = date_time_start, pscis_crossing_id, my_crossing_reference),
  by = c("pscis_crossing_id", "my_crossing_reference")
) %>% 
  mutate(time_diff = date_time_start_og - date_time_start_repaired)

max(qa$time_diff)
#> Time difference of 0.4919996 secs

# lets see if we can alter the timezone of the date_time_start column
# this is a better way to do it since it takes into account daylight savings time!!!
form_pscis_zone <- form_pscis %>% 
  mutate(date_time_start_zone_utc= force_tz(date_time_start, tzone = "America/Vancouver")) %>% 
  mutate(date_time_start_zone_pdt = with_tz(date_time_start_zone_utc, tzone = "UTC")) %>% 
  select(contains("date_time"), pscis_crossing_id, my_crossing_reference)

qa2 <- left_join(
  form_pscis_zone %>% 
    select(contains("date_time"), pscis_crossing_id, my_crossing_reference),
  form_pscis_og %>%
    st_drop_geometry() %>% 
    select(date_time_start_og = date_time_start, pscis_crossing_id, my_crossing_reference),
  by = c("pscis_crossing_id", "my_crossing_reference")
) %>% 
  mutate(time_diff = date_time_start - date_time_start_zone_pdt,
         time_diff_og = date_time_start_og - date_time_start_zone_pdt)

max(qa2$time_diff)
#> Time difference of 0 secs
max(qa2$time_diff_og)
#> Time difference of 7.000137 hours
lucy-schick commented 5 months ago

Confused about this part

# lets see if we can alter the timezone of the date_time_start column
# this is a better way to do it since it takes into account daylight savings time!!!
form_pscis_zone <- form_pscis %>% 
  mutate(date_time_start_zone_utc= force_tz(date_time_start, tzone = "America/Vancouver")) %>% 
  mutate(date_time_start_zone_pdt = with_tz(date_time_start_zone_utc, tzone = "UTC")) %>% 
  select(contains("date_time"), pscis_crossing_id, my_crossing_reference)

why does date_time_start_zone_utc contain date_time_start in the PDT timezone, and vice versa with date_time_start_zone_pdt. Maybe just a small mistake but I think it should be

library(tidyverse)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.3, PROJ 9.3.1; sf_use_s2() is TRUE

dir_project <- 'sern_skeena_2023'

form_pscis <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/data_field/2023/form_pscis_2023.gpkg')) %>% 
  dplyr::mutate(date_time_start_repaired = date_time_start + hours(7)) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis_2023' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/data_field/2023/form_pscis_2023.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 59 features and 91 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 1008863 xmax: 958901.9 ymax: 1169499
#> Projected CRS: NAD83 / BC Albers

form_pscis_og <- sf::st_read(dsn= paste0('~/Projects/gis/', dir_project, '/form_pscis.gpkg')) %>% 
  select(contains("date_time"), everything())
#> Reading layer `form_pscis' from data source 
#>   `/Users/airvine/Projects/gis/sern_skeena_2023/form_pscis.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 60 features and 88 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 842859 ymin: 447409.8 xmax: 1036232 ymax: 1169510
#> Projected CRS: NAD83 / BC Albers

qa <- left_join(
  form_pscis %>% 
    select(contains("date_time"), pscis_crossing_id, my_crossing_reference),
  form_pscis_og %>%
    st_drop_geometry() %>% 
    select(date_time_start_og = date_time_start, pscis_crossing_id, my_crossing_reference),
  by = c("pscis_crossing_id", "my_crossing_reference")
) %>% 
  mutate(time_diff = date_time_start_og - date_time_start_repaired)

max(qa$time_diff)
#> Time difference of 0.4919996 secs

# lets see if we can alter the timezone of the date_time_start column
# this is a better way to do it since it takes into account daylight savings time!!!
form_pscis_zone <- form_pscis %>% 
  mutate(date_time_start_zone_utc= force_tz(date_time_start, tzone = "UTC")) %>% 
  mutate(date_time_start_zone_pdt = with_tz(date_time_start_zone_utc, tzone = "America/Vancouver")) %>% 
  select(contains("date_time"), pscis_crossing_id, my_crossing_reference)

qa2 <- left_join(
  form_pscis_zone %>% 
    select(contains("date_time"), pscis_crossing_id, my_crossing_reference),
  form_pscis_og %>%
    st_drop_geometry() %>% 
    select(date_time_start_og = date_time_start, pscis_crossing_id, my_crossing_reference),
  by = c("pscis_crossing_id", "my_crossing_reference")
) %>% 
  mutate(time_diff = date_time_start - date_time_start_zone_pdt,
         time_diff_og = date_time_start_og - date_time_start_zone_pdt)

max(qa2$time_diff)
#> Time difference of 7 hours
max(qa2$time_diff_og)
#> Time difference of 14.00014 hours

Or am I misunderstanding this?

UTC is 7 hrs ahead of PDT. So running the above code subtracts 7hrs from date_time_start resulting in the wrong time. We need to add 7hrs to date_time_start to obtain the date_time_start_og. Thinking about this again, is this why you set date_time_start to PDT and then converted it to UTC, which would result in plus 7hrs?? The param names are just confusing me.

lucy-schick commented 5 months ago

Regarding "Check out the weird thing that happens when date_time_start is renamed to date_time_start_og "

This also doesn't make sense to me.

Using site 8478 as an example. time_diff = 06:30:02 - 13:30:02 = 7hrs time_diff_og = 13:30:02 - 13:30:02 = 0

Screen Shot 2024-03-26 at 1 59 59 PM

but running your code shows the opposite. I must be missing something.

mutate(time_diff = date_time_start - date_time_start_zone_pdt,
         time_diff_og = date_time_start_og - date_time_start_zone_pdt)

max(qa2$time_diff)
#> Time difference of 0 secs
max(qa2$time_diff_og)
#> Time difference of 7.000137 hours
NewGraphEnvironment commented 3 months ago

mixed bag of UTC and PDT in 2023 version of form!!! So very strange.... Ha

image

lucy-schick commented 3 months ago

so odd!