Add haul and specimen samplesize summary calculation example

EmilyMarkowitz-NOAA commented 7 months ago

Issue

This year BS and GOA stock assessors requested summary tables of: 1) how many hauls a type of sample (e.g., lengths, otoliths) was collected at and 2) how many total samples were collected across a survey

In the past, we prepared summary "samplesize" tables with these values (samplesize_nbs.csv and samplesize_ebs.csv in the Bering Sea and possibly by request in GOA/AI(?)) from SQL scripts that relied on RACEBASE and RACE_DATA oracle schemata. Below is an excerpt from samplesize_nbs.csv:

SPECIES_CODE	SPECIES_NAME	COMMON_NAME	YEAR	TOTAL_HAULS	HAULS_W_LENGTH	NUM_LENGTHS	HAULS_W_OTOLITHS	HAULS_W_AGES	NUM_OTOLITHS	NUM_AGES
10115	Reinhardtius hippoglossoides	Greenland turbot	2019	144	6	7	2		3
10115	Reinhardtius hippoglossoides	Greenland turbot	2023	116	2	3	2		3

Solution notes

Rewrite old samplesize scripts to use the`GAP_PRODUCTS.AKFIN_*` tables

We need to modernize these scripts to use the GAP_PRODUCTS.AKFIN_* tables to ensure that only data from standard hauls that have been approved for stock assessments are incorporated into these summary values (aka make sure all data products match). Note that even after updating these summaries, not all summary calculations will be able to be summarized from the GAP_PRODUCTS.AKFIN_* tables and still require a few reference and data tables in RACE_DATA (especially for calculating the number of lengths that were collected).

Decide on sharing platform** for these summary values

Options:

Add an example recreating old samplesize csv's in the gap_products example documentation
- pros: Give users the tools to calculate these summary values themselves
- cons: Users would need access to RACE_DATA (not recommended)
- notes: I was previously hopeful we could do this and previously advocated for this, but don't think we can/should because it requires access to RACE_DATA tables.
Suggest that users submit an issue to the data-request repo for this
- pros: Only a few users request this product and we could then tailor this table to user's needs.
- cons: This workflow may lead to redundancy
Create a new AKFIN_SAMPLESIZE (name negotiable) table
- pros: All values for all possible summaries are prepared and in one place. Users will not need to calculate values or require access to RACE_DATA tables. This table could additionally be useful for preparing data reports and end-of-year ADFG report summaries.
- cons: This becomes another table that has to be created annually.
- notes: I think this is the solution I am currently leaning towards (but could be talked away from).

Decide on the table layout (wide or long) and column names

Here I used the prefix N_ for summary values to show that these are count
I included suffixes of _H and _S to say if the value was the number of hauls samples were collected at or total samples that were collected, respectively. Open to suggestions.
N_WEIGHT(_H), N_COUNT(_H), N_LENGTH(_H), and N_HAUL are redundant to GAP_PRODUCTS.AKFIN_BIOMASS but are included here, at least temporarily, to match with old tables. We could remove these from here or move them from AKFIN_BIOMASS, depending on opinion.

Long (my vote)
- Only includes relevant content and no empty/NULL values. Flexible to different summary types.

SPECIES_CODE	AREA_ID	YEAR	N_SAMPLE_TYPE	N
…	…	…	HAULS	#
…	…	…	COUNT_H	#
…	…	…	WEIGHT_H	#
…	…	…	LENGTHS_H	#
…	…	…	LENGTHS_S	#
…	…	…	GENETICS_H	#
…	…	…	GENETICS_S	#
…	…	…	STOMACHS_H	#
…	…	…	STOMACHS_S	#

Wide (similar to old tables)
- Clunky and includes empty/NULL values. Each new summary type requires a column.

SPECIES_CODE	AREA_ID	YEAR	N_HAULS	N_COUNT_H	N_WEIGHT_H	N_LENGTH_H	N_LENGTH_S	N_OTOLITHS_H	N_OTOLITHS_S	N_AGES_H	N_AGES_S
…	…	…	#	#	#	#	#	#	#	#	#

Tasklist

### Tasks
- [ ] Choose how these data will be shared (e.g., [documentation example](https://github.com/afsc-gap-products/gap_products/blob/main/content/akfin-api-r.qmd)/[data-request repo](https://github.com/afsc-gap-products/data-requests)/new `GAP_PRODUCTS.AKFIN_SAMPLESIZE` table)
- [ ] Decide on table name (`AKFIN_SAMPLESIZE` is used here, but negotiable)
- [ ] Decide on table organization (wide or long)
- [ ] Decide on column names nomenclature (e.g., prefix with `N_`; differentiate between haul counts (something like `_H` and `_S`, respectively?))
- [ ] Review if the summary values are correct and table coverage is meeting need (e.g., types of samples) is meeting need.

First steps

I've started the summary-tbl-ex branch to work on this issue. I've added some initial code for developing long tables with sharing options 1 (documentation example) and 3 (new GAP_PRODUCTS.AKFIN_SAMPLESIZE table).

zoyafuso-NOAA commented 7 months ago

Questions: 1) What is the spatial scale at which these data were provided? Is it all regional estimates or is the idea to provide this by all the spatial scales (stratum, subarea, management area, region)? I think it's regional but just checking to make sure. 2) How many sample types are there? Shouldn't this table only be restricted to core collections? Special projects outside of the core collection are trickier because I don't think those data are in our database (although I would love to know where they live if so). 3) What is GENETICS? Is that like for special projects (e.g., eDNA, fin clips, eyeballs, etc.)? 4) In the wide-example, what is the difference between the NOTOLITH and NAGES fields? 5) Stomach Data: Is the stomach collection data in RACE_DATA somewhere? Would this table distinguish stomachs scanned at sea by stomach lab folks vs stomachs collected at sea by the stomach collector? Would this table tabulate how many stomachs have been processed either in lab or at sea? And would this include the target number of stomachs? Same questions for otoliths (tabulating target number and/or collected otoliths and/or read otoliths). Wouldn't the food habits lab already have/provide this information somewhere (idk shrug emoji). 6) I am also leaning towards providing this as a standalone materialized view that is queried along with the other AKFIN tables. It’s a non-standard request but it doesn’t take a whole lot of finagling to query unlike for example GOA_SPLIT_FRACTIONS. Do you think AKFIN would be able and/or willing to accommodate a table like this? 7) What is the timeline for development on this new table? Do we have a list of folks who have made this request so we can direct their feedback on the new table?

To build on Emily's table example, I would modify it to this:

SURVEY_DEFINITION_ID | AREA_ID | SPECIES_CODE | YEAR | SPECIMEN_TYPE | N_SAMPLE | COUNT -- | -- | -- | -- | -- | -- | -- 98 | 99900 | 10110 | 2022 | LENGTH | | 98 | 99900 | 10110 | 2022 | STOMACH | | 98 | 99900 | 10110 | 2022 | OTOLITH | |

The “N_” prefix has been our convention when denoting haul-level sample sizes (e.g., N_LENGTH is the number of hauls with length data for a given species_code/year/area_id). N_SAMPLE could be the number of hauls where we collected a particular SPECIMEN_TYPE (one of LENGTH, STOMACH, OTOLITH if we restrict ourselves to core collections). We could then use something like COUNT to denote the number of sampled individuals similar to how we use the COUNT field in GAP_PRODUCTS.CPUE to denote the number of individuals (although currently the description for the COUNT field is "Total whole number of individuals caught in haul." so maybe a synonym of COUNT or we need to generalize the field description).

EmilyMarkowitz-NOAA commented 7 months ago

Responding and summing up from yesterday's discussion:

Regional, but I am open to the idea of other area_ids.
For now, we'll focus on core collections (lengths, otoliths, ages), but in the future it would be great to include other special collections. I've added an issue (https://github.com/afsc-gap-products/gap_products/issues/41) about this so we can try and separately tackle quantifying non-core collections in the data processing working group (or something like it). We should take a look at the new special_collections* tables in the future_orale google spreadsheet together I just (very quickly) drafted and make sure we would be able to use them for this samplesize table in the future.
In this case, genetics is just an example. For example, we currently collect genetics fin clips (like those noted here but you are right, there are lots of different genetics collections. That column should really be an ID key for any type of collection or special project that collected it.
N_OTOLITHS is how many otoliths were collected on deck during the survey, and N_AGES is how many otoliths were aged by age and growth.
I don't believe (but could be wrong) that the stomach/age data are available on oracle... possibly not even through the stomach/age folk's oracles. We'll need to follow up with them. The best we currently have (to my knowledge) are the .csv's from the stomach deck tablets.
I think AKFIN is ready and happy to take on any data products that we want to give them. Though I think a materialized view for just us could be fine, too, if we go that way.
For now, we have lots of time. Timeline, if we go with this idea, could be to have this done a year from now? I think this table will become more useful and important the more types of collections are summarized in it.
I like your new table design! Seconded.

zoyafuso-NOAA commented 6 months ago

I am going to backtrack on my response here after thinking about this issue after the discussion here. I now propose that if we restrict this table to just lengths, otoliths, and read otoliths, the table could be kept in a wide format.

This is a draft query to create something that kind of looks like samplesize_nbs.csv and Table 5 in the 2022 Bering Sea Report

This is a preview of the 2022 EBS records of that table:

Yes, I am also jarred by the number of nulls and yes it's not the nicest looking from a database perspective. However, this table would only be treated as a VIEW and works pretty well for summary purposes only so I'm not too miffed on wide vs long here. If the number of quantities of interest was longer (addition of stomachs, genetics, etc.) then that would be a different issue. Additionally but off topic, for AI/GOA purposes the format of this version of the table would be helpful during our otolith collection rule negotiations.

The counts of the lengths are a little bit off if you compare them with the 2022 Bering Sea Report but I commented in the code that the way I queried RACE_DATA.LENGTHS may include hauls with ABUNDANCE_HAUL = 'N', so it's approximate.

EmilyMarkowitz-NOAA commented 5 months ago

Summarizing discussion today between @Ned-Laman-NOAA, @zoyafuso-NOAA, and myself:

1. Use long form of table

AREA_ID only for major survey areas (99900, 99901, 99902, ... 999nn)

AREA_ID	SPECIES_CODE	YEAR	SPECIMEN_SAMPLE_TYPES	N_SAMPLE	COUNT
99900	10110	2022	1 [otolith]	1	2
99900	10110	2022	[length; no code yet]	3	4
99900	10110	2022	[age; no code yet]	5	[NULL]

2. Sample scope

For now, just summarize otolith, length, and age data.

Hopefully, in the near future, we can add

Stomach data with coordination with the stomach group
all SPECIMEN_SAMPLE_TYPES listed in the RACE_DATA.SPECIMEN_SAMPLE_TYPES table/are currently collected by the specimen tablet on deck.

In the far future, it would be great to add

special projects data https://github.com/afsc-gap-products/gap_products/issues/41 (e.g., genetic fin clips, etc.).

3. New column definitions

I've added these definitions to the future_oracle.METADATA_COLUMN tab

N_SAMPLE is similar to N_COUNT, N_HAUL, N_LENGTH, N_SPECIMENS, NWEIGHT where this will identify the total number of hauls where this sample was collected (the sum of positive collection). Namely: "Total number of hauls with positive sample collection. "_
COUNT is the number of samples that were collected. Namely: "Total whole number of individuals caught in haul or samples collected.".
SPECIMEN_SAMPLE_TYPES is derived from the RACE_DATA.SPECIMEN_SAMPLE_TYPES table. We would need to add some specimen sample types to this table. Namely: _"The specimen sample type ID code as defined in the RACE_DATA.SPECIMEN_SAMPLE_TYPES table. For a complete list of Specimen sample type ID codes, review the code books."_

4. Add SPECIMEN_SAMPLE_TYPES to the RACE_DATA.SPECIMEN_SAMPLE_TYPES

At minimum to make this work, we would need to add an AGE and LENGTH specimen type. Is this something @ChrisAnderson-NOAA should do? EDIT: Are there other implications to modifying this table? For example, this table may be used to populate the specimen tablet (just a theory) and we wouldn't want to mess that up. Who would be able to check? Mahesh?

zoyafuso-NOAA commented 5 months ago

Screenshot of query from this commit: 86a511a

EmilyMarkowitz-NOAA commented 5 months ago

Looks great! Thanks for putting this together! But did we decide to use survey_definition_id or area_id? And what's the name of the table? I'm not seeing in oracle, but I could just be missing it.

zoyafuso-NOAA commented 5 months ago

We did talk about it, I don't know whether we came to a resolution. But SURVEY_DEFINITION_ID should suffice given these summaries are at the region level. For Bering Sea purposes, SURVEY_DEFINITION_ID = 98 will only refer to the EBS+NW area.

It's not in Oracle because it is still a draft and we haven't confirmed it yet. For now, just copy and run the query

zoyafuso-NOAA commented 4 months ago

This is now a VIEW in GAP_PRODUCTS. Take a look at it. SELECT * FROM GAP_PRODUCTS.SAMPLESIZE. We should close this issue and create a new one for any future improvements or requests.

Duane-Stevenson-NOAA commented 4 months ago

This looks great, Zack!

Ned-Laman-NOAA commented 4 months ago

Looks good

EmilyMarkowitz-NOAA commented 4 months ago

It looks good. My only comment is that should really use (also use?) codes from SPECIMEN_SAMPLE_TYPES is derived from the RACE_DATA.SPECIMEN_SAMPLE_TYPES table (see comment from Jan 9).

zoyafuso-NOAA commented 4 months ago

I remember this discussion, the issue was that we don't have a sample type id that codes for a length measurement nor do we have a sample type id that distinguishes between a collected otolith and a read otolith. Of course we could add them but I am unsure of what the implications would be w.r.t the infrastructure of the RACE_DATA schema. I get scared when we start thinking about messing around with the RACE_DATA tables.

SELECT * FROM RACE_DATA.SPECIMEN_SAMPLE_TYPES

SPECIMEN_SAMPLE_TYPE_ID	DESCRIPTION
1	Otoliths
2	Scales
3	Otoliths and Scales
4	Scales and Dorsal Fin Spine
5	Dorsal Fin Spine or Ray
6	Opercular Bone
7	Oxytetracycline (OTC) Marked Otolith
8	Length Weight
0	Undefined
9	Vertebra(e)
10	Whole Organism
11	Genetic Tissue and Otoliths
12	Stomach

Regardless, the VIEW as it is currently resolves the initial issue posted. So I think we can close it and then make another issue for maybe improvements to the SPECIMEN_SAMPLE_TYPES table (and that discussion can live there), which could then set up another issue to improve this VIEW.

Ned-Laman-NOAA commented 4 months ago

Aged otoliths are indicated by a record with specimen_sample_type = 1 where age is not null, right? I can't think if a reason we'd want to know about failed ageing attempts; that sounds like age&growth lab data. If we need a code for "length" as different from "length weight" then we should simply be able to add another line to the list of specimen_sample_types without breaking anything.

zoyafuso-NOAA commented 4 months ago

I guess stock assessors are interested in the number of aged otoliths whereas we're concerned with that number as well as the total number of otolith collected (regardless if they were aged or not). That latter value is what is presented to the Plan Team Meetings and reported in at least the Bering Sea survey Data Report Tech Memo (not sure if the AIGOA report that as well).

Glad to know that the length cade can be added fairly painlessly.

Ned-Laman-NOAA commented 4 months ago

Then I stand by my original statement. We don't need a code that indicates an otolith was aged in our specimen table lookup. Aged will be indicated by a not null age for an otolith specimen. Look at it this way. We'd need to get the code for "aged" from Age&Growth because there are some otoliths they try to age and fail. We'd have to use their code to distinguish between otoliths they didn't age (null age) and otoliths the failed to age (null age). Maybe this was obvious and didn't need repeating, but in the interest of thoroughness...

afsc-gap-products / gap_products