Possible issues in `haul_dur`, `num` and `cpue` related columns

edwardlavender commented 1 year ago

I was having a quick look at the CPUE-related variables & I noticed likely issues in multiple columns, unless I am doing something silly:

haul_dur ranges from < 0.01 to > 24. Is this right? I wonder if there has been a mix up with units here which propagates to CPUE-related columns.
num includes negative values and decimals.
num_cpue and wgt_cpue columns do not match num/haul_dur or wgt/haul_dur. There are also num_cpue values in the hundreds of millions.

I haven't checked the _cpua columns, but it seems likely that these issues will propagate & affect those too.

It might be worth including tests for negative numbers, whole numbers/decimals, numerical ranges/unique values & additional mathematical checks (e.g., num/haul_dur, num/haul_dur/area_swept etc.) into your quality checking routines for relevant columns (or double checking your existing tests).

## Load data
load(url("https://github.com/AquaAuma/FishGlob_data/blob/main/outputs/Compiled_data/FishGlob_public_clean.RData?raw=true"))
d <- as.data.frame(data)

## Select relative abundance metric (e.g., num{_cpue} or wgt{_cpue})
metric <- c("num", "wgt")
metric <- metric[1] # metric[2]

## Check example columns
# Check haul duration, which varies from 0.01 to > 24 hours
# ... is there an issue with units here? 
range(d$haul_dur, na.rm = TRUE)
# Check selected metric column:
# ... For num, we have many negative numbers (but not weights)
range(d[, metric], na.rm = TRUE)  
# Check the first few unique values
# ... For num, we have decimals - is this problematic?
head(sort(unique(d[, metric]))) 

## Calculate CPUE by number or weight and compare to dataset column
# NB: These calculations may be affected by issues in both num and haul_dur
# ... columns identified above.
d$ra   <- d[, paste0(metric, "_cpue")]
d$ra_2 <- d[, metric]/d$haul_dur

## Compare calculated and provided values: they should be identical. 
isTRUE(all.equal(d$ra, d$ra_2))

## Isolate problematic data
tol <- 1e-6
cols <- c("survey", "haul_id", "accepted_name", metric, "haul_dur", "ra", "ra_2")
issues <- d[which(abs(d$ra - d$ra_2) > tol), cols]
issues$delta <- issues$ra_2 - issues$ra

## Examine problematic data
head(issues)
tail(issues)

## Visualise problematic data & get ranges
pp <- par(mfrow = c(2, 2))
hist(issues$num); range(issues$num, na.rm = TRUE)
hist(issues$ra); range(issues$ra)
hist(issues$ra_2); range(issues$ra_2)
hist(issues$delta); range(issues$delta)
par(pp)

## Pull out select examples with very high CPUE
# We have num_cpue in the hundreds of millions... 
head(issues[order(issues$ra, decreasing = TRUE), ])

Hope this helps!

edwardlavender commented 1 year ago

Relatedly, there are sometimes substantial discrepancies between num and wgt.

This is easiest to see by pulling out the largest num values for each species:

library(dplyr)
load(url("https://github.com/AquaAuma/FishGlob_data/blob/main/outputs/Compiled_data/FishGlob_public_clean.RData?raw=true"))
d <- as.data.frame(data)
d |> 
  group_by(accepted_name) |> 
  # Pull out largest numbers 
  filter(num == max(num)) |>
  ungroup() |>
  select(accepted_name, max_num = num, wgt = wgt) |>
  # Calculate implied individual weight in grams
  mutate(wgt_per_id_in_grams = wgt/max_num * 1000) |>
  arrange(desc(max_num))

E.g., 223,298 Prionotus paralatus individuals apparently only weigh 4.87 kg (or 0.0218 g each).

tomjwebb commented 1 year ago

First I just want to say how much I appreciate this data product that is going to prove extremely useful for all kinds of applications!

But just picking up on @edwardlavender's point about negative abundance values, we have also spotted these. I think they ultimately result from ICES using -9 as a missing value code (I'm trying to get confirmation of this, but it is something I have come across for other variables in ICES data before). This will then propagate into a range of negative values when converted to CPUE or CPUA, and most likely when aggregating to species level too (i.e. combining size classes). To give an example - after loading the compiled data FishGlob_public_std_clean.RData, the object data contains the full database, then:

data %>% filter(num < 0) %>% nrow()

Shows 7561 negative abundances, from the following surveys:

> data %>% filter(num_cpua < 0) %>% count(survey)
# A tibble: 6 × 2
  survey       n
  <chr>    <int>
1 BITS      1841
2 EVHOE       14
3 FR-CGFS   2041
4 NS-IBTS   3433
5 PT-IBTS    229
6 SWC-IBTS     3

We can cross-check some of these using the ICES data directly - here using the icesDatras package, and extracting data for one survey where I can see there are negative values:

# check using ICES data directly
library(icesDatras)
library(icesVocab)

bits_2001_1 <- getHLdata(survey = "BITS", year = 2001, quarter = 1) %>%
  as_tibble()

bits_2001_1 %>% 
  filter(TotalNo < 0) %>% 
  select(Survey, Quarter, Ship, HaulNo, SpecCode, TotalNo)

Gives

   Survey Quarter Ship  HaulNo SpecCode TotalNo
   <chr>    <int> <chr>  <int>    <int>   <dbl>
 1 BITS         1 AA36      20   127123      -9
 2 BITS         1 AA36      20   127203      -9
 3 BITS         1 AA36      20   126450      -9
 4 BITS         1 AA36      20   126736      -9
 5 BITS         1 AA36       2   127141      -9
 6 BITS         1 AA36       2   127123      -9
 7 BITS         1 AA36       2   126417      -9
 8 BITS         1 AA36      20   127214      -9
 9 BITS         1 AA36      20   127141      -9
10 BITS         1 AA36      20   127149      -9
11 BITS         1 AA36      20   126436      -9
12 BITS         1 AA36      20   126417      -9
13 BITS         1 AA36       2   126436      -9
14 BITS         1 AA36       2   127214      -9

Summarising to species level (by haul) in this case does not cause any problems (total species-level abundances are either -9 or positive) but I guess it may in some cases lead to combining -9 with positive abundances - which would make the simple solution of just filtering out negative abundances possibly problematic.

tomjwebb commented 1 year ago

Confirmed from ICES:

What does “-9” stand for? Many fields in DATRAS records can have -9 instead of a code or value. -9 is not a real value, but an agreed code to represent an empty field, it is used by data submitters when no data available for the field or when the field is irrelevant for the given survey.

Page 9 (sadly not -9…) here https://www.ices.dk/data/Documents/DATRAS/DATRAS_FAQs.pdf - and thanks to the @ICES_ASC twitter account for being responsive!