AquaAuma / FishGlob_data

Database and methods related to the manuscript "An integrated database of fish biodiversity sampled with scientific bottom trawl surveys"
Creative Commons Attribution 4.0 International
21 stars 7 forks source link

Possible issues in `haul_dur`, `num` and `cpue` related columns #24

Closed edwardlavender closed 9 months ago

edwardlavender commented 1 year ago

I was having a quick look at the CPUE-related variables & I noticed likely issues in multiple columns, unless I am doing something silly:

I haven't checked the _cpua columns, but it seems likely that these issues will propagate & affect those too.

It might be worth including tests for negative numbers, whole numbers/decimals, numerical ranges/unique values & additional mathematical checks (e.g., num/haul_dur, num/haul_dur/area_swept etc.) into your quality checking routines for relevant columns (or double checking your existing tests).

## Load data
load(url("https://github.com/AquaAuma/FishGlob_data/blob/main/outputs/Compiled_data/FishGlob_public_clean.RData?raw=true"))
d <- as.data.frame(data)

## Select relative abundance metric (e.g., num{_cpue} or wgt{_cpue})
metric <- c("num", "wgt")
metric <- metric[1] # metric[2]

## Check example columns
# Check haul duration, which varies from 0.01 to > 24 hours
# ... is there an issue with units here? 
range(d$haul_dur, na.rm = TRUE)
# Check selected metric column:
# ... For num, we have many negative numbers (but not weights)
range(d[, metric], na.rm = TRUE)  
# Check the first few unique values
# ... For num, we have decimals - is this problematic?
head(sort(unique(d[, metric]))) 

## Calculate CPUE by number or weight and compare to dataset column
# NB: These calculations may be affected by issues in both num and haul_dur
# ... columns identified above.
d$ra   <- d[, paste0(metric, "_cpue")]
d$ra_2 <- d[, metric]/d$haul_dur

## Compare calculated and provided values: they should be identical. 
isTRUE(all.equal(d$ra, d$ra_2))

## Isolate problematic data
tol <- 1e-6
cols <- c("survey", "haul_id", "accepted_name", metric, "haul_dur", "ra", "ra_2")
issues <- d[which(abs(d$ra - d$ra_2) > tol), cols]
issues$delta <- issues$ra_2 - issues$ra

## Examine problematic data
head(issues)
tail(issues)

## Visualise problematic data & get ranges
pp <- par(mfrow = c(2, 2))
hist(issues$num); range(issues$num, na.rm = TRUE)
hist(issues$ra); range(issues$ra)
hist(issues$ra_2); range(issues$ra_2)
hist(issues$delta); range(issues$delta)
par(pp)

## Pull out select examples with very high CPUE
# We have num_cpue in the hundreds of millions... 
head(issues[order(issues$ra, decreasing = TRUE), ])

Hope this helps!

edwardlavender commented 1 year ago

Relatedly, there are sometimes substantial discrepancies between num and wgt.

This is easiest to see by pulling out the largest num values for each species:

library(dplyr)
load(url("https://github.com/AquaAuma/FishGlob_data/blob/main/outputs/Compiled_data/FishGlob_public_clean.RData?raw=true"))
d <- as.data.frame(data)
d |> 
  group_by(accepted_name) |> 
  # Pull out largest numbers 
  filter(num == max(num)) |>
  ungroup() |>
  select(accepted_name, max_num = num, wgt = wgt) |>
  # Calculate implied individual weight in grams
  mutate(wgt_per_id_in_grams = wgt/max_num * 1000) |>
  arrange(desc(max_num))

E.g., 223,298 Prionotus paralatus individuals apparently only weigh 4.87 kg (or 0.0218 g each).

tomjwebb commented 1 year ago

First I just want to say how much I appreciate this data product that is going to prove extremely useful for all kinds of applications!

But just picking up on @edwardlavender's point about negative abundance values, we have also spotted these. I think they ultimately result from ICES using -9 as a missing value code (I'm trying to get confirmation of this, but it is something I have come across for other variables in ICES data before). This will then propagate into a range of negative values when converted to CPUE or CPUA, and most likely when aggregating to species level too (i.e. combining size classes). To give an example - after loading the compiled data FishGlob_public_std_clean.RData, the object data contains the full database, then:

data %>% filter(num < 0) %>% nrow()

Shows 7561 negative abundances, from the following surveys:

> data %>% filter(num_cpua < 0) %>% count(survey)
# A tibble: 6 × 2
  survey       n
  <chr>    <int>
1 BITS      1841
2 EVHOE       14
3 FR-CGFS   2041
4 NS-IBTS   3433
5 PT-IBTS    229
6 SWC-IBTS     3

We can cross-check some of these using the ICES data directly - here using the icesDatras package, and extracting data for one survey where I can see there are negative values:

# check using ICES data directly
library(icesDatras)
library(icesVocab)

bits_2001_1 <- getHLdata(survey = "BITS", year = 2001, quarter = 1) %>%
  as_tibble()

bits_2001_1 %>% 
  filter(TotalNo < 0) %>% 
  select(Survey, Quarter, Ship, HaulNo, SpecCode, TotalNo)

Gives

   Survey Quarter Ship  HaulNo SpecCode TotalNo
   <chr>    <int> <chr>  <int>    <int>   <dbl>
 1 BITS         1 AA36      20   127123      -9
 2 BITS         1 AA36      20   127203      -9
 3 BITS         1 AA36      20   126450      -9
 4 BITS         1 AA36      20   126736      -9
 5 BITS         1 AA36       2   127141      -9
 6 BITS         1 AA36       2   127123      -9
 7 BITS         1 AA36       2   126417      -9
 8 BITS         1 AA36      20   127214      -9
 9 BITS         1 AA36      20   127141      -9
10 BITS         1 AA36      20   127149      -9
11 BITS         1 AA36      20   126436      -9
12 BITS         1 AA36      20   126417      -9
13 BITS         1 AA36       2   126436      -9
14 BITS         1 AA36       2   127214      -9

Summarising to species level (by haul) in this case does not cause any problems (total species-level abundances are either -9 or positive) but I guess it may in some cases lead to combining -9 with positive abundances - which would make the simple solution of just filtering out negative abundances possibly problematic.

tomjwebb commented 1 year ago

Confirmed from ICES:

What does “-9” stand for? Many fields in DATRAS records can have -9 instead of a code or value. -9 is not a real value, but an agreed code to represent an empty field, it is used by data submitters when no data available for the field or when the field is irrelevant for the given survey.

Page 9 (sadly not -9…) here https://www.ices.dk/data/Documents/DATRAS/DATRAS_FAQs.pdf - and thanks to the @ICES_ASC twitter account for being responsive!

AquaAuma commented 1 year ago

Thank you both for looking into this! Getting back to this very soon, and I will fix and update the dataset/code

AquaAuma commented 1 year ago

@tomjwebb do we know whether it might happen that for one species in one haul, we might find positive abundances and "-9" values? Wondering whether we might have aggregated positive and negative abundances, or if "-9" is only used for empty values of species that have no other record in that specific haul.

tomjwebb commented 1 year ago

@AquaAuma I'm not sure, there is minimal documentation for what causes a missing value to be entered. As you say it becomes a more complicated issue if positive abundances risk being aggregated with -9. I will have a look and see if I can find any examples, but I'm unsure how that would be done systematically. It might be worth looking at previous data products built on the ICES data (e.g. https://data.marine.gov.scot/dataset/manual-version-3-groundfish-survey-monitoring-and-assessment-data-product and updates such as https://data.cefas.co.uk/view/21421) to see if this issue has previously been addressed.

AquaAuma commented 1 year ago

I tried yesterday, and I did not encounter hauls in which taxa are assigned both positive and -9 values. It seems to happen when individual lengths are not measured, but total number of individuals/weights are still reported. I had kept this in the data originally, but never assigned the abundance/weight reported at the taxa level because I only focused on recalculated weights from abundance at length information. This seems to concern specific taxa in the surveys which might be important (or not), and in some cases it's very minor. I am not sure what the best way to handle this in the dataset is. In any case, this will need to be flagged as something for users to be aware of.

The number of hauls for which this happens are: 13 out of 3283 in EVHOE 168 out of 1179 in PT-IBTS 649 out of 10423 in BITS 1129 out of 2071 in FR-CGFS (consistently the same few taxa) 1229 out of 25217 in NS-IBTS 3 out of 3850 in SWC-IBTS

tomjwebb commented 1 year ago

Yes I think you're correct - if you use the TotalNo for haul-level species abundances, this should be either a positive value or -9, so there will be no issues with aggregating negative and positive values. The issue would arise potentially if you were deriving total numbers in a haul by summing the haul number-at-length, as missing values for a single length class would result in summing -9 with positive counts. Maybe the best thing is to add a QC flag identifying records with negative abundances? This would mean they could still be used to record presence of a species, but could easily be filtered out for users wanting to analyse abundance?

AquaAuma commented 1 year ago

To keep all fish taxa pres-abs records: -re-including taxa with SpecCode == to 5, 6, 8 to include all pres-abs records in the DATRAS dataset -transforming negative values into NA -will add a specific flag in the dataset for this

AquaAuma commented 1 year ago

I've fixed the -9 codes throughout the entire DATRAS R code, so no negative values should appear (only positive or NAs). Keeping the issue open because I haven't double checked the ranges/correspondance between columns yet

AquaAuma commented 9 months ago

@mpinsky @zoekitchel do you know why there might be mismatches between num and wgt in the North American surveys? Ed showed an example of Prionotus paralatus (GMEX species it seems). seems like an ongoing issue we might keep, but I don't have much time to look into this

mpinsky commented 9 months ago

These aren't necessarily mismatches, and it would require asking the data providers to sort down what happened and why. The trawls sometimes catch large aggregations of small individuals, so it is not necessarily impossible. I would suggest making sure the numbers were in the raw data files we got from the data providers, and if yes, keep it in the dataset.

zoekitchel commented 9 months ago

I can look into the raw data for GMEX!

On Fri, Nov 10, 2023 at 2:18 PM Malin Pinsky @.***> wrote:

These aren't necessarily mismatches, and it would require asking the data providers to sort down what happened and why. The trawls sometimes catch large aggregations of small individuals, so it is not necessarily impossible. I would suggest making sure the numbers were in the raw data files we got from the data providers, and if yes, keep it in the dataset.

— Reply to this email directly, view it on GitHub https://github.com/AquaAuma/FishGlob_data/issues/24#issuecomment-1806510693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQNR7UTPBMVSYJP6ELOJBLYD2R3VAVCNFSM6AAAAAAYVS2DFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGUYTANRZGM . You are receiving this because you were mentioned.Message ID: @.***>

-- Zoë Kitchel 207-807-4426

zoekitchel commented 9 months ago

Just traced Ed's observation for Prionotus paralatus in haul ID 004174480632793 back to original files we have from Summer SEAMAP Groundfish Survey.

mpinsky commented 9 months ago

I think that means we keep it, barring other information from the data providers?

AquaAuma commented 9 months ago

I agree, no need to change the data. We could always ask Ed for more information and which surveys seem problematic overall, but I don't have more time to spend on this issue.

On Mon, Nov 13, 2023 at 7:33 PM Malin Pinsky @.***> wrote:

I think that means we keep it, barring other information from the data providers?

— Reply to this email directly, view it on GitHub https://github.com/AquaAuma/FishGlob_data/issues/24#issuecomment-1808771687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGBQUKYVOCPXFBJRHHXBNDYEJRXBAVCNFSM6AAAAAAYVS2DFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYG43TCNRYG4 . You are receiving this because you were mentioned.Message ID: @.***>