afsc-gap-products / gap_products

This repository supports code used to create tables in the GAP_PRODUCTS Oracle schema. These tables include the master production tables, tables shared with AKFIN, and tables publicly shared on FOSS.
https://afsc-gap-products.github.io/gap_products/
Creative Commons Zero v1.0 Universal
5 stars 5 forks source link

Duplicated length records #22

Closed MattCallahan-NOAA closed 4 months ago

MattCallahan-NOAA commented 5 months ago

Issue

Hey I got a question from @rclairer about duplicate rows in the gap_products.akfin_length table, is this expected? Here are a couple of examples but the duplicates seem widespread.

--4 duplicated rows select * from gap_products.akfin_length where hauljoin = -6619 and species_code = 21740 and sex = 1 and frequency = 1 and length = 620 and length_type = 1;

--7 duplicated rows select * from gap_products.akfin_length where hauljoin = -5812 and species_code = 30576 and length = 570 and sex = 1 and frequency = 1 and length_type = 1 and sample_type = 1;

zoyafuso-NOAA commented 5 months ago

Hi @MattCallahan-NOAA and @rclairer. Thanks for this flag. This may be related to the #20 issue. We have a potential fix to the query that creates the AKFIN length table. When I get back later this week, I will rerun those corrected queries and then check back in with this issue.

MattCallahan-NOAA commented 5 months ago

Awesome, thanks Zack, hopefully that fix solves this too.

zoyafuso-NOAA commented 5 months ago

Hi @MattCallahan-NOAA

I updated all of the the AKFIN_* tables on the GAP_PRODUCTS schema with @Ned-Laman-NOAA revised SQL code from https://github.com/afsc-gap-products/gap_products/commit/8521fcf7986616a692fb556f94bd992abd8d893b. I ran these queries and don't seem to get duplicates anymore. Can you confirm and if you're good, close the comment.

EmilyMarkowitz-NOAA commented 4 months ago

Super! Thanks for finding this issue @MattCallahan-NOAA and for sorting this, @zoyafuso-NOAA ! Just to confirm, I see that:

There are no and sex = 1 and length_mm = 620 entries from the first query now - is this right?

image

And these look like the second query has been successfully been un-duplicated:

image

Do these look right? Note @MattCallahan-NOAA, you'll need to look at the AFSC oracle to see these as we haven't asked OFIS to create the oracle dump file for AKFIN yet.

MattCallahan-NOAA commented 4 months ago

I suspect that there is some sort of new issue with missing data.

I get 62 distinct pollock length bins in haul -6619 on the AKFIN side

Also the row count for the akfin_length table went from 148 million to 4 million, which is a much bigger drop than I would expect.

The second example seems to be resolved though.

zoyafuso-NOAA commented 4 months ago

I didn't even think to check of the sizes of these tables, good catch. 4.6 million records is within the range of reason, as RACEBASE.LENGTH only has 7 million records.

A good check on HAULJOIN -6619 is to do that same query on RACEBASE.LENGTH and I can confirm that there should be four unique length records.

select * from GAP_PRODUCTS.AKFIN_LENGTH where hauljoin = -6619 and species_code = 21740

select * from racebase.length where hauljoin = -6619 and species_code = 21740

MattCallahan-NOAA commented 4 months ago

Great, this will make querying the length table way faster!

Thanks for the tip of going back to RACEBASE for QC questions too.

If you're happy with the table than so am I.