NEFSC / READ-SSB-CHAJI-Effort-Displacement---Scallop

Other
0 stars 0 forks source link

fishset loading data #210

Closed mle2718 closed 11 months ago

mle2718 commented 11 months ago

@BryceMcManus-NOAA,

@mchaji and I are both getting this error message when we try to load data into fishset. We're on fresh installs from gitlabs

https://github.com/NEFSC/READ-SSB-CHAJI-Effort-Displacement---Scallop/blob/f9a75bca6da88b630a0f9af6f6b579fbc9023870/analysis_code/scallop_analysis_0322.Rmd#L155-L170

> load_maindata(dat = final_product_lease, project = project, over_write = TRUE)
Table saved to database
Warning:  142839 failed to parse.Error in FUN(X[[i]], ...) : 
  Date format not recognized. Format date before proceeding

I tried a bit of debugging by doing this:

d_cols <- date_cols(final_product_lease)
lapply(dataset[d_cols], date_parser)
d_cols
[1] "DATE_TRIP"             "Date"                  "Time"                 
[4] "NAME"                  "lease_FS"              "scallop_fishing_yearD"

I was able to run the scallop analysis code on our old server, using an older FishSET install, so I suspect that this is an issue with some development of fishset. In particular, it's a little odd that NAME and lease_FS are being picked up as date fields. These are the name of the Wind areas, it looks something like OCS-A 0538 - Attentive Energy LLC | OCS-A 0538 - Attentive Energy LLC. They both have lots of NAs.

Any idea what's going on?

BryceMcManus-NOAA commented 11 months ago

The problem seems to be that the methods used by date_cols() to determine whether a column is a date variable don't anticipate whatever is going on in the data. In the FishSET version of the scallop data it leaves NAME alone (lease_FS doesn't exist). It's hard to know exactly what's going on unless I run your data.

BryceMcManus-NOAA commented 11 months ago

One thing that might help me understand what's going on without sending me the data is pasting the unique values from NAME and lease_FS into the comments. The problem may be that one particular name is throwing date_cols() off.

mle2718 commented 11 months ago

This is NAME, but lease_FS is the same:

                                     OCS-A 0482 - GSOE I LLC 
                                                          18 
            OCS-A 0483 - Virginia Electric and Power Company 
                                                           1 
                           OCS-A 0486 - Revolution Wind, LLC 
                                                         358 
                               OCS-A 0487 - Sunrise Wind LLC 
                                                         735 
                                   OCS-A 0490 - US Wind Inc. 
                                                          19 
                                 OCS-A 0498 - Ocean Wind LLC 
                                                          20 

OCS-A 0499 - Atlantic Shores Offshore Wind Projects 1 & 2, LLC's 63 OCS-A 0500 - Bay State Wind LLC 258 OCS-A 0501 - Vineyard Wind LLC 22 OCS-A 0508 - Avangrid Renewables LLC 1 OCS-A 0512 - Empire Offshore Wind, LLC 1409 OCS-A 0517 - South Fork Wind, LLC 37 OCS-A 0519 - Skipjack Offshore Energy LLC 16 OCS-A 0520 - Beacon Wind LLC 8 OCS-A 0521 - Mayflower Wind Energy LLC 11 OCS-A 0522 - Vineyard Northeast LLC 8 OCS-A 0532 - Orsted North America Inc. 21 OCS-A 0534 - Park City Wind LLC 34 OCS-A 0537 - OW Ocean Winds East, LLC 683 OCS-A 0538 - Attentive Energy LLC 1150 OCS-A 0539 - Community Offshore Wind, LLC 1058 OCS-A 0541 - Atlantic Shores Offshore Wind Bight, LLC 1709 OCS-A 0542 - Invenergy Wind Offshore LLC 882 OCS-A 0549 - Atlantic Shores Offshore Wind, LLC 71 Provisional - OCS-A 0544 - Mid-Atlantic Offshore Wind LLC 594

its definitely this variable, I tried this:

dataset<-final_product_lease
d_cols <- date_cols(dataset)
dataset[d_cols] <- lapply(d_cols, function(d) as.character(dataset[[d]]))
d_cols

test<-c("Time")
#dataset[d_cols] <- lapply(dataset[d_cols], date_parser)

dataset[test] <- lapply(dataset[test], date_parser)

and stepped through the different entries of d_cols. The only things that throw an error are the NAME and lease_fs columns. Time throws a single, very mysterious "142839 failed to parse" error. It's surprising because the value of Tine in that row is "16:30:00" which is a pretty normal looking

mle2718 commented 11 months ago

I also stepped through a bit of the date_cols function.

dataset<-final_product_lease
names(dataset)
dat<-dataset

# This is taken from date_cols()
date_lgl <- logical(ncol(dat))
names(date_lgl) <- names(dat)
date_funs <- list(lubridate::mdy, lubridate::dmy, lubridate::ymd, 
                  lubridate::ydm, lubridate::dym)
date_helper <- function(dates, fun) {
  dates <- trimws(dates)
  dates <- gsub("\\s\\d{2}:\\d{2}:\\d{2}$", "", dates)
  out <- rlang::expr(!all(is.na(suppressWarnings((!!fun)(!!dates)))))
  eval(out)
}
date_apply <- function(dates) {
  any(purrr::map_lgl(date_funs, function(fun) date_helper(dates, 
                                                          fun)))
}
nr <- nrow(dat)
# if (nr > 1000) 
  dat_slice <- 1000
# else dat_slice <- round(nr * 0.5)
date_cols <- purrr::map_lgl(dat[!numeric_cols(dat, "logical")][seq_len(dat_slice), 
], date_apply)
date_cols <- date_cols[date_cols]

output

> date_cols 
            DATE_TRIP                  Date                  Time                  NAME 
                 TRUE                  TRUE                  TRUE                  TRUE 
             lease_FS scallop_fishing_yearD 
                 TRUE                  TRUE 

I'm not sure what is going on inside those functions though.

BryceMcManus-NOAA commented 11 months ago

The problem is that date_cols() tries to detect date columns by passing them to 5 lubridate conversion functions, two of which (ymd() and ydm()) incorrectly identify "OCS-A 0499 - Atlantic Shores Offshore Wind Projects 1 & 2, LLC's" as a date. I'm not sure why, but it converts it to "0499-02-01".

The good news is that this is a relatively easy fix. Bad news is that it will require a new install once the changes are made. The only work around I can think of is to change the value of that name so that it doesn't trigger the conversion.

BryceMcManus-NOAA commented 11 months ago

The reason Time is raising a warning is that date_parser() doesn't work on time-only columns (i.e. no calendar date, just time). This just means that additional checks need to be added to load_maindata().

mle2718 commented 11 months ago

The problem is that date_cols() tries to detect date columns by passing them to 5 lubridate conversion functions, two of which (ymd() and ydm()) incorrectly identify "OCS-A 0499 - Atlantic Shores Offshore Wind Projects 1 & 2, LLC's" as a date. I'm not sure why, but it converts it to "0499-02-01".

The good news is that this is a relatively easy fix. Bad news is that it will require a new install once the changes are made. The only work around I can think of is to change the value of that name so that it doesn't trigger the conversion.

Very odd. Looks like lubridate is very aggressive about finding dates and times.

I did this:

final_product_lease <- final_product_lease %>%
  mutate(KILOGRAMS = POUNDS/pounds_to_kg,
         LANDED_KG=LANDED/pounds_to_kg) %>%
  mutate(NAME= stringr::str_replace(NAME,"OCS-A 0499", "OCS-A0499"),
         lease_FS=stringr::str_replace(lease_FS,"OCS-A 0499", "OCS-A0499") )

as a workaround.

mle2718 commented 11 months ago

@mchaji -- I've made this change in main here: 4c62d8fa905bda4b68d7d83ed8aa54253493b856 I cherry-picked it over to the scallop_tiny_report branch here: 48ebc54e5c52ce6559ec1a381429c49f4ecfcfe8.

As long as you re-pull, you should pick up this change.