afsc-gap-products / gap_products

This repository supports code used to create tables in the GAP_PRODUCTS Oracle schema. These tables include the master production tables, tables shared with AKFIN, and tables publicly shared on FOSS.
https://afsc-gap-products.github.io/gap_products/
Creative Commons Zero v1.0 Universal
5 stars 5 forks source link

Missing Pacific Cod in AKFIN_SPECIMEN #20

Closed EmilyMarkowitz-NOAA closed 4 months ago

EmilyMarkowitz-NOAA commented 5 months ago

Reposting this issue emailed to me by @zandjyo. Thanks for exploring this, Steve!

Issue Description

There seems to be some differences between AKFIN_SPECIMEN and RACEBASE.SPECIMEN. @zandjyo: "I had used specimen going back to 1983, the AKFIN_SPECIMEN only has data to 1993 for BS cod and AKFIN_LENGTH has some issues with missing data as well."

This is likely an issue with cruisejoin and cruise_id join on https://github.com/afsc-gap-products/gap_products/blob/main/code/sql_akfin/AKFIN_SPECIMEN.sql#L37

### Tasks
- [x] Find reason for mismatch and correct data
- [x] Provide reasoning for why tables are not matching
- [ ] Add reason for mismatch to data changes doc in data report
- [ ] Make into AKFIN data usage example

Steps to Reproduce

From @zandjyo: "Checking on this table shows some missing data as well. Pulling cod data we are missing all the 1980s except 1985, and all the 1990s except 1993, then the 2000s start in 2005 when pulling cod lengths for survey_definition_id in (98,143). There should be length data going back to at least 1982 for every year except 2020.

"Here is my SQL script for this:" image

Comparison of last year's BS pcod size comps with this year from the GAP_PRODUCTS. Very minor changes, mostly completely overlapping. image

zoyafuso-NOAA commented 5 months ago

I will be without network access for the next two weeks so I am heavily leaning on @Ned-Laman-NOAA to figure this one out. The same issue is in play for the AKFIN_LENGTH query so whatever you revise for AKFIN_SPECIMEN will need to be done for AKFIN_LENGTH. A clue to how to resolve this is to look at the query for AKFIN_CRUISE where you join vessel ID, cruise, survey ID, survey_definition_id, and organization ID. Once the problem is found and settled, @EmilyMarkowitz-NOAA, can you work with Ned to replace the revised SQL code via GitHub? Thanks all.

Ned-Laman-NOAA commented 5 months ago

I have revised the two queries for LENGTH and SPECIMEN data and they no reproduce the master cruise list query for AKFIN (see SQLPlus below). The issue was that the constraint statement abs(CRUISE_ID) = CRUISEJOIN was not correct. It seems that abs(CRUISE_ID) does match CRUISEJOIN for reconciled data in RACE_DATA (2005 to present), but does not match CRUISEJOIN for unreconciled cruises preceding 2005.

Here is the Master Cruise List for AKFIN that I was trying to match:

-- master cruise list for AKFIN
select distinct 
a.cruisejoin,
a.region,
a.vessel,
b.cruise, 
d.survey_definition_id, 
d.survey_name, 
d.description,
e.name vessel_name,
f.acronym sponsor_acronym, 
c.start_date date_start, 
c.END_DATE date_end
from racebase.haul a, 
race_data.cruises b, 
race_data.surveys c, 
race_data.survey_definitions d, 
race_data.vessels e, 
race_data.organizations f
where a.vessel = b.vessel_id
and b.vessel_id = e.vessel_id
and a.cruise = b.cruise 
and c.survey_id =  b.survey_id 
and c.survey_definition_id = d.survey_definition_id 
and d.survey_definition_id in (143,98,47,52,78)
and a.abundance_haul = 'Y'
and f.organization_id = c.sponsor_organization_id
order by a.region, b.cruise
/
-- 187 cruises and 187 vessel, cruises

Here is the corrected code for LENGTH and SPECIMEN. I've cleaned it up a bit from the original by eliminating some commented out lines and such so that the only commented out lines of code are those that I replaced with the ALL CAPS lines:

select a.hauljoin,
species_code, 
sex, 
frequency, 
length as length_mm, 
length_type, 
sample_type 
from racebase.length a, race_data.cruises b, race_data.surveys c, 
race_data.survey_definitions d, racebase.haul f
--where abs(a.cruisejoin) = b.cruise_id -- change
WHERE A.CRUISE = B.CRUISE -- change
AND A.VESSEL = B.VESSEL_ID -- change
and a.hauljoin = f.hauljoin
and c.survey_id =  b.survey_id 
and c.survey_definition_id = d.SURVEY_DEFINITION_ID 
and d.survey_definition_id in (143, 98, 47, 52, 78)
and f.abundance_haul = 'Y'
/

and for SPECIMEN

select  a.hauljoin, 
a.specimenid as specimen_id, 
a.species_code, 
a.length as length_mm, 
a.sex, 
a.weight as weight_g, 
a.age, 
a.maturity, 
a.gonad_wt as gonad_g, 
a.specimen_subsample_method, 
a.specimen_sample_type, 
a.age_determination_method 
from racebase.specimen a, 
race_data.cruises b, 
race_data.surveys c, 
race_data.survey_definitions d, 
racebase.haul f
-- where abs(a.cruisejoin) = b.cruise_id -- change
WHERE A.CRUISE = B.CRUISE -- change
AND A.VESSEL = B.VESSEL_ID -- change
and a.hauljoin = f.hauljoin
and c.survey_id =  b.survey_id 
and c.survey_definition_id = d.SURVEY_DEFINITION_ID 
and d.survey_definition_id in (143, 98, 47, 52, 78)
and f.abundance_haul = 'Y';
Ned-Laman-NOAA commented 5 months ago

Note that for the LENGTH query above, it reproduces that same 187 cruises as the Master Cruise list. For SPECIMEN, it reproduces 186 cruises of the 187 from the Master List because GOA 198402, Vessel 57 did not have any specimen data recorded in RACEBase.

EmilyMarkowitz-NOAA commented 5 months ago

Thanks for this, Ned! I just added the edits you've noted above to the AKFIN_LENGTH and AKFIN_SPECIMEN.

Next step is to review offending abundance_haul designations/rules to change which cruises get included in the AKFIN_CRUISES table, and therefore what data are included in these tables. This is discussed in https://github.com/afsc-gap-products/gap_public_data/issues/10.

I'll let @zoyafuso-NOAA close this issue once he reruns the gap_products tables

zoyafuso-NOAA commented 5 months ago

Thanks @Ned-Laman-NOAA and @EmilyMarkowitz-NOAA, I appreciate your help with this issue. When I get back to the states on the 10th, I'll rerun and follow up with @zandjyo. Sorry @zandjyo for the slight delay.

zoyafuso-NOAA commented 5 months ago

Hi @zandjyo

I updated all of the the AKFIN_* tables on the GAP_PRODUCTS schema with @Ned-Laman-NOAA revised SQL code from 8521fcf7986616a692fb556f94bd992abd8d893b. I ran Steve's query from above (in text below) and was able to now get length records from 1982-2023 (sans 2020). Steve, can you compare your records again and get back to us? Thanks all!

SELECT GAP_PRODUCTS.AKFIN_LENGTH.SPECIES_CODE, GAP_PRODUCTS.AKFIN_CRUISE.YEAR, SUM(GAP_PRODUCTS.AKFIN_LENGTH.FREQUENCY) AS "SURVEY_LENGTH"

FROM GAP_PRODUCTS.AKFIN_HAUL

INNER JOIN GAP_PRODUCTS.AKFIN_CRUISE ON GAP_PRODUCTS.AKFIN_CRUISE.CRUISEJOIN = GAP_PRODUCTS.AKFIN_HAUL.CRUISEJOIN

INNER JOIN GAP_PRODUCTS.AKFIN_LENGTH ON GAP_PRODUCTS.AKFIN_HAUL.HAULJOIN = GAP_PRODUCTS.AKFIN_LENGTH.HAULJOIN

WHERE GAP_PRODUCTS.AKFIN_CRUISE.SURVEY_DEFINITION_ID IN (98,143) AND GAP_PRODUCTS.AKFIN_LENGTH.SPECIES_CODE IN (21720)

GROUP BY GAP_PRODUCTS.AKFIN_LENGTH.SPECIES_CODE, GAP_PRODUCTS.AKFIN_CRUISE.YEAR

ORDER BY GAP_PRODUCTS.AKFIN_LENGTH.SPECIES_CODE, GAP_PRODUCTS.AKFIN_CRUISE.YEAR

(Edit: adding photo of output from AFSC Oracle -- @EmilyMarkowitz-NOAA) image

zoyafuso-NOAA commented 4 months ago

I'm going to close this issue, I think we've resolved this. If Steve or anyone else comes back to us and it's not fully resolved, we can reopen this issue.