afsc-gap-products / gap_products

This repository supports code used to create tables in the GAP_PRODUCTS Oracle schema. These tables include the master production tables, tables shared with AKFIN, and tables publicly shared on FOSS.
https://afsc-gap-products.github.io/gap_products/
Creative Commons Zero v1.0 Universal
5 stars 5 forks source link

Discrepancies between gap_products and legacy tables #30

Closed MattCallahan-NOAA closed 4 months ago

MattCallahan-NOAA commented 4 months ago

Issue

GAP_PRODUCTS users may find differences in index value for a given, stratum, species, and year combo between the old legacy tables and the new tables. Lee's issue raised at office hours yesterday may be an example of this.

In order to help identify discrepancies, @jeanlee-akfin wrote an example query finding records with different AI population numbers between the old and new tables. I modified it to run on the AFSC oracle database and limited to records in which both old and new tables had values (There are also a lot of instances in which records are present in one set but not the other). This could be applied to other regions, and tables like sizecomp or agecomp. Stock assessors likely won't care about a lot of these instances involving obscure species but sometimes discrepancies involve commercially important groundfish (e.g. AI strata 313 in 2022). Hopefully this is helpful.

WITH LEGACY AS ( SELECT YEAR, SPECIES_CODE, STRATUM, SURVEY, STRATUM_POP
--FROM AFSC.RACE_BIOMASSSTRATUMAIGOA --Table on AKFIN FROM AI.BIOMASS_STRATUM WHERE SURVEY = 'AI' AND YEAR BETWEEN 2004 AND 2022 ORDER BY YEAR), NEWTAB AS (SELECT YEAR, SPECIES_CODE, a.AREA_ID AS STRATUM, a.SURVEY_DEFINITION_ID, POPULATION_COUNT AS STRATUM_POP FROM gap_products.akfin_biomass a LEFT JOIN GAP_PRODUCTS.AKFIN_AREA b ON a.SURVEY_DEFINITION_ID = b.SURVEY_DEFINITION_ID AND a.AREA_ID = b.AREA_ID WHERE a.survey_definition_id = 52 AND a.year BETWEEN 2004 AND 2022 AND b.AREA_TYPE = 'STRATUM' AND a.AREA_ID NOT IN (793, 794)) -- soon to be resolved duplicates, issue 28 SELECT * FROM (SELECT COALESCE (A.YEAR, B.YEAR) AS YEAR, COALESCE (A.SPECIES_CODE, B.SPECIES_CODE) AS SPECIES_CODE, COALESCE (A.STRATUM, B.STRATUM) AS STRATUM, A.STRATUM_POP AS OLD_VAL, B.STRATUM_POP AS NEW_VAL, CASE WHEN NVL (A.STRATUM_POP, 0) <> NVL (B.STRATUM_POP, 0) THEN 1 ELSE 0 END AS MISMATCH FROM LEGACY A FULL OUTER JOIN NEWTAB B ON A.YEAR = B.YEAR AND A.SPECIES_CODE = B.SPECIES_CODE AND A.STRATUM = B.STRATUM) WHERE MISMATCH = 1 and old_val is not null and new_val is not null and new_val >0;

jeanlee-akfin commented 4 months ago

Thanks @MattCallahan-NOAA...just wanted to note I'm aware there is an error in the NEWTAB subquery because the join between akfin_biomass and akfin_area tables should go through the akfin_survey_design table. I'll updated with a corrected query.

zoyafuso-NOAA commented 4 months ago

This is fantastic, thanks for this. I'm gonna put this piece of code in my shirt pocket for later. I am not sure how to resolve this issue except to give a long-winded response as you'll see below.

We have internally discussed in the past about how to communicate all of these discrepancies and at what level of detail to do so. The way that we have dealt with it in the development of GAP_PRODUCTS has been on two fronts:

  1. Appendix B on the Data Changes Brief gives very high-level reasons as to why users may see differences in the historical and GAP_PRODUCTS tables for their stock of choice. This was the product of early bridging efforts in the development of the gapindex package. However, it does not give users specific clarity as to why they are seeing discrepancies for their own case uses.
  2. EBS/NBS draft data report: initially there was the intent to provide more detail on these discrepancies only for NBS and EBS records. But that seemed to be more detail than appropriate for a Tech Memo. The most recent draft of that contains a compromise of level of detail.
  3. Ad hoc consultation via these office hours and github issues. Perhaps an appropriate avenue for "in the weeds" discussions about specific use cases and discrepancies.

Most of these discrepancies in the GOA and AI historical tables will be due to changes in the input data.

(skip this paragraph to avoid details..) For example the species vouchering process changes records in RACEBASE.CATCH but these changes aren't usually updated in the historical AIGOA CPUE table. The AI 2022 stratum 313 example discrepancy is because one haul in stratum 313 in 2022 was "removed" from standard calculations in RACEBASE.HAUL (ABUNDANCE_HAUL Y -> N) after the fact. That change was propagated to the AI.CPUE table but not to the AI.BIOMASS_STRATUM table (and subsequent AI sizecomp/agecomp tables). Processed otoliths data are throughout the year being added to RACEBASE.SPECIMEN and depending on the timing of when that occurs and when the AI.AGECOMP table are updated, there may be mismatches.

All of that to say, the practice of using the most updated input data coming from RACEBASE and RACE_DATA and recalculating cpue/indices/comps for all years and all regions in the GAP_PRODUCTS workflow is perceived as an improvement to the way we (in the AIGOA group) do things (that is my opinion). The challenge of that is tracking changes, which highlights the discussion we had last office hour about changelogs, leveraging audit tables, triggers, etc.

MattCallahan-NOAA commented 4 months ago

Thanks for the detailed response! I am glad you found the query useful, hopefully it saves you some time in the future when diagnosing data differences. This is such a general issue that I think you can just close it since as you say the level of detail necessary to fully address may be more appropriate for a tech memo than a github issue.