Multi year data in `wt_summarise_cam()`

VLucet commented 11 months ago

Hi 👋🏼 ! Here is a bug @eric-jolin and I have found in wt_summarise_cam(). I've also opened a PR (#40) that fixes it, and also attempts to improve the code in general (see the PR to those details).

When a user provides data from more than one year, the data is aggregated across years and leads to situations where the number of days of effort can be more than 7 for a given week, and more than 31 for a month. It would also artificially inflate the effort.

times_start <- c("2021-08-01 17:04:40", "2021-09-25 18:37:46", "2021-10-02 16:12:38", "2021-11-02 14:41:04",
                 "2022-04-06 10:12:58", "2022-04-07 12:34:04", "2022-04-22 09:30:52", "2022-04-26 09:54:46",
                 "2022-04-26 15:06:42", "2022-04-27 08:36:27", "2022-04-30 09:30:29", "2022-05-13 10:07:33",
                 "2022-08-10 10:40:17")
times_end <- c("2021-08-01 17:04:41 UTC", "2021-09-25 18:37:46 UTC", "2021-10-02 16:12:38 UTC",
               "2021-11-02 14:41:04 UTC", "2022-04-06 10:13:00 UTC", "2022-04-07 12:57:56 UTC",
               "2022-04-22 09:30:53 UTC", "2022-04-26 09:54:47 UTC", "2022-04-26 15:06:43 UTC",
               "2022-04-27 08:36:28 UTC", "2022-04-30 09:30:31 UTC", "2022-05-13 10:07:34 UTC",
               "2022-08-10 10:40:18 UTC")

raw_dat <-  data.frame(detection = 1:13,
                       project_id = "P1",
                       location = "Loc1",
                       species_common_name = "Sp1",
                       image_date_time = times_start,
                       max_animals = 1)

ind_dat <- data.frame(detection = 1:13,
                      project_id = "P1",
                      location = "Loc1",
                      species_common_name = "Sp1",
                      start_time = times_start,
                      end_time = times_end,
                      max_animals = 1)

wt_summarise_cam(detect_data = ind_dat, raw_data = raw_dat,
                 time_interval = "month",
                 variable = "detections",
                 output_format = "long")

Which gives:

Joining with `by = join_by(project_id, location, month, species_common_name)`
# A tibble: 12 × 7
   project_id location month     n_days_effort species_common_name variable   value
   <chr>      <chr>    <ord>             <int> <chr>               <chr>      <int>
 1 P1         Loc1     January              31 Sp1                 detections     0
 2 P1         Loc1     February             28 Sp1                 detections     0
 3 P1         Loc1     March                31 Sp1                 detections     0
 4 P1         Loc1     April                30 Sp1                 detections     7
 5 P1         Loc1     May                  31 Sp1                 detections     1
 6 P1         Loc1     June                 30 Sp1                 detections     0
 7 P1         Loc1     July                 31 Sp1                 detections     0
 8 P1         Loc1     August               41 Sp1                 detections     2
 9 P1         Loc1     September            30 Sp1                 detections     1
10 P1         Loc1     October              31 Sp1                 detections     1
11 P1         Loc1     November             30 Sp1                 detections     1
12 P1         Loc1     December             31 Sp1                 detections     0

Clearly this function should be grouping and aggregating data by year instead, regardless of the time frame requested. The PR brings that feature:

Joining with `by = join_by(project_id, year, month, species_common_name)`
# A tibble: 13 × 8
   project_id location  year month     n_days_effort species_common_name variable   value
   <chr>      <chr>    <dbl> <ord>             <int> <chr>               <chr>      <int>
 1 P1         Loc1      2021 August               31 Sp1                 detections     1
 2 P1         Loc1      2021 September            30 Sp1                 detections     1
 3 P1         Loc1      2021 October              31 Sp1                 detections     1
 4 P1         Loc1      2021 November             30 Sp1                 detections     1
 5 P1         Loc1      2021 December             31 Sp1                 detections     0
 6 P1         Loc1      2022 January              31 Sp1                 detections     0
 7 P1         Loc1      2022 February             28 Sp1                 detections     0
 8 P1         Loc1      2022 March                31 Sp1                 detections     0
 9 P1         Loc1      2022 April                30 Sp1                 detections     7
10 P1         Loc1      2022 May                  31 Sp1                 detections     1
11 P1         Loc1      2022 June                 30 Sp1                 detections     0
12 P1         Loc1      2022 July                 31 Sp1                 detections     0
13 P1         Loc1      2022 August               10 Sp1                 detections     1

(Also, see the PR for general code cleanliness/coherence improvement I thought could be useful to bring. Feel free to ignore those if they seem superfluous)

VLucet commented 11 months ago

~~Looking at this with fresh eyes there are a few issues, namely that year are taken as independent deployments, which might not be accurate. Back to working on the PR.~~ => Fixed in latest commit in PR, I've updated the comment above to reflect that.

mabecker89 commented 10 months ago

Fixed with c5b8e3a49e285e7755bd2bc169c3aed95c2be00a

ABbiodiversity / wildrtrax

Multi year data in `wt_summarise_cam()` #39