How should region specific biological data should be processed for inclusion in the package?

chantelwetzel-noaa commented 4 months ago

I have been reviewing the code for pulling, processing, and plotting AFSC survey data created and shared by @MattCallahan-NOAA. Based on the readme in the akfingapdata repository it appears that the pulling and processing of catch and biological data are done within a single function (or the data stored in the database have already been expanded) by the get_gap_biomass(), goasr_sizecomp(), and get_gap_agecomp(). Is this correct?

Here is a description on how the NWFS survey data are stored and processed:

Raw catch and biological sample data is stored in the data warehouse and can be pulled using the pull_catch() and pull_bio() the in the nwfscSurvey package.
A species-specific stratification by latitude and depth is specified.
The raw data and the selected stratification are then used to calculate a species-specific design-based index with biomass and uncertainty by year with Biomass.fn().
The raw biological size or age composition data is then expanded up to the tow level and the stratification area using SurveyAFs.fn(). The output of this function is a formatted matrix of the proportion by year, sex, and size/age bins.

In my mind, there are a few different potential pathways here:

Store raw data only and then have functions within the package that do the expansions,
store only expanded data within the package, or
both raw and processed data.

Once we have decided upon this, then we can dig into creating unified data frames.

MattCallahan-NOAA commented 4 months ago

Hi Chantel, The AFSC Groundfish Assessment Program (GAP) calculates agecomps, sizecomps, and biomass indices from raw data in their database using the gapindex package. These indices and other ready-for-stock assessment data are then available in the gap_products schema on their oracle database. GAP also transfers the gap_products schema to the Alaska Fisheries Information Network (AKFIN, my employer) for distribution. I created an api for each gap_products table, and the akfingapdata package is a wrapper for those apis, with each function pulling the data from one table, by species and area (or the whole table for smaller tables).

Since GAP has already put in the heavy lifting of calculating indices and vetting specimen/lengths/catch/etc. in the gap_products framework I think it makes sense to use that for these visualizations in Alaska.

kellijohnson-NOAA commented 4 months ago

I think that all regions have put in a lot of work to create code and data and that part of the process here is working towards shared code and similar data structures. It would be really great to slowly work towards a shared set of functions to process the data. This would (1) reduce the amount of code needed to make this effort happen, (2) lead to less errors in code because more eyes would be using and reviewing it, and (3) create a process of working together on more than just plots. I understand if that cannot happen at this stage but at a minimum I think we should have a conversation about what the "input" data here look like.

DFO-NOAA-Pacific / gfsynopsis-noaa

How should region specific biological data should be processed for inclusion in the package? #2