Closed MattCallahan-NOAA closed 4 months ago
Thanks @MattCallahan-NOAA...just wanted to note I'm aware there is an error in the NEWTAB subquery because the join between akfin_biomass and akfin_area tables should go through the akfin_survey_design table. I'll updated with a corrected query.
This is fantastic, thanks for this. I'm gonna put this piece of code in my shirt pocket for later. I am not sure how to resolve this issue except to give a long-winded response as you'll see below.
We have internally discussed in the past about how to communicate all of these discrepancies and at what level of detail to do so. The way that we have dealt with it in the development of GAP_PRODUCTS has been on two fronts:
Most of these discrepancies in the GOA and AI historical tables will be due to changes in the input data.
(skip this paragraph to avoid details..) For example the species vouchering process changes records in RACEBASE.CATCH but these changes aren't usually updated in the historical AIGOA CPUE table. The AI 2022 stratum 313 example discrepancy is because one haul in stratum 313 in 2022 was "removed" from standard calculations in RACEBASE.HAUL (ABUNDANCE_HAUL Y -> N) after the fact. That change was propagated to the AI.CPUE table but not to the AI.BIOMASS_STRATUM table (and subsequent AI sizecomp/agecomp tables). Processed otoliths data are throughout the year being added to RACEBASE.SPECIMEN and depending on the timing of when that occurs and when the AI.AGECOMP table are updated, there may be mismatches.
All of that to say, the practice of using the most updated input data coming from RACEBASE and RACE_DATA and recalculating cpue/indices/comps for all years and all regions in the GAP_PRODUCTS workflow is perceived as an improvement to the way we (in the AIGOA group) do things (that is my opinion). The challenge of that is tracking changes, which highlights the discussion we had last office hour about changelogs, leveraging audit tables, triggers, etc.
Thanks for the detailed response! I am glad you found the query useful, hopefully it saves you some time in the future when diagnosing data differences. This is such a general issue that I think you can just close it since as you say the level of detail necessary to fully address may be more appropriate for a tech memo than a github issue.
Issue
GAP_PRODUCTS users may find differences in index value for a given, stratum, species, and year combo between the old legacy tables and the new tables. Lee's issue raised at office hours yesterday may be an example of this.
In order to help identify discrepancies, @jeanlee-akfin wrote an example query finding records with different AI population numbers between the old and new tables. I modified it to run on the AFSC oracle database and limited to records in which both old and new tables had values (There are also a lot of instances in which records are present in one set but not the other). This could be applied to other regions, and tables like sizecomp or agecomp. Stock assessors likely won't care about a lot of these instances involving obscure species but sometimes discrepancies involve commercially important groundfish (e.g. AI strata 313 in 2022). Hopefully this is helpful.
WITH LEGACY AS ( SELECT YEAR, SPECIES_CODE, STRATUM, SURVEY, STRATUM_POP
--FROM AFSC.RACE_BIOMASSSTRATUMAIGOA --Table on AKFIN FROM AI.BIOMASS_STRATUM WHERE SURVEY = 'AI' AND YEAR BETWEEN 2004 AND 2022 ORDER BY YEAR), NEWTAB AS (SELECT YEAR, SPECIES_CODE, a.AREA_ID AS STRATUM, a.SURVEY_DEFINITION_ID, POPULATION_COUNT AS STRATUM_POP FROM gap_products.akfin_biomass a LEFT JOIN GAP_PRODUCTS.AKFIN_AREA b ON a.SURVEY_DEFINITION_ID = b.SURVEY_DEFINITION_ID AND a.AREA_ID = b.AREA_ID WHERE a.survey_definition_id = 52 AND a.year BETWEEN 2004 AND 2022 AND b.AREA_TYPE = 'STRATUM' AND a.AREA_ID NOT IN (793, 794)) -- soon to be resolved duplicates, issue 28 SELECT * FROM (SELECT COALESCE (A.YEAR, B.YEAR) AS YEAR, COALESCE (A.SPECIES_CODE, B.SPECIES_CODE) AS SPECIES_CODE, COALESCE (A.STRATUM, B.STRATUM) AS STRATUM, A.STRATUM_POP AS OLD_VAL, B.STRATUM_POP AS NEW_VAL, CASE WHEN NVL (A.STRATUM_POP, 0) <> NVL (B.STRATUM_POP, 0) THEN 1 ELSE 0 END AS MISMATCH FROM LEGACY A FULL OUTER JOIN NEWTAB B ON A.YEAR = B.YEAR AND A.SPECIES_CODE = B.SPECIES_CODE AND A.STRATUM = B.STRATUM) WHERE MISMATCH = 1 and old_val is not null and new_val is not null and new_val >0;