Time cut-offs for including species in CPUE tables

sean-rohan-NOAA commented 7 months ago

Issue

Has there been any effort to check that all species in the CPUE tables would have been identified to the taxonomic level by the date of first possible identification?

Currently, it appears there are some species that have haul-level CPUEs in gap_products.cpue from before they were added to RACEBASE and I'm wondering if they would have been identified during our surveys by the time they were included. It looks like that isn't the case for all species (e.g. NRS in the EBS), but I'm wondering how those decisions were made and whether that was consistent across species.

For example, species_code 95041 (noodle bryozoans, Alcyonidium enteromorpha) was added to RACEBASE in 2013.

select * from racebase.species where species_code = 95041

There are zero-filled values for 2010 (all zeros).

select f.year, f.haul, f.station stationid, f.longitude_dd_start longitude,
    f.latitude_dd_start latitude, c.cpue_kgkm2/100 cpue_kgha
    from gap_products.cpue c, gap_products.foss_haul f
    where c.hauljoin = f.hauljoin 
    and f.survey_definition_id = 143
    and f.year = 2010
    and c.species_code = 95041 
    order by cpue_kgha desc

@Duane-Stevenson-NOAA Does that mean there were actually no noodle bryozoans in the NBS in 2010, or were we just not identifying them to species? Although noodle bryozoans were added in 2013, it seems like the cutoff for inclusion as a whole should not be the date they were added because they were identified during the 2012 Chukchi Sea survey:

select * from racebase.catch where species_code = 95041 and cruise < 201300

zoyafuso-NOAA commented 7 months ago

Hi Sean,

Thanks for noticing this and great points. Within the GAP_PRODUCTS production code, the call to gapindex::calc_cpue zero-fills records for all species_codes for all of the survey years for a given survey region. So for that noodle bryozoan example, my guess is that they weren't identified to species in that 2010 survey and are probably false zeros.

There is a later step in the process that applies a year cutoff for only a handful of fish taxa. That lookup table is in GAP_PRODUCTS.SPECIES_YEAR:

and reflects the temporal stanzas included in table 1 of these EFH documents: GOA, EBS, AI

For example, for ATF/Kams, the YEAR_ADDED in RACEBASE.SPECIES is 1984 but the cutoff year in these tables is 1992 to (I presume) reflect our taxonomic confidence/consistency of identification in our surveys. So the discrepancy between the year in the YEAR_ADDED field in RACEBASE.SPECIES and the starting year in these EFH documents prevented us from using the YEAR_ADDED field in RACEBASE.SPECIES as a filter for the GAP_PRODUCTS tables. We went back and forth with @Ned-Laman-NOAA about the temporal stanzas and came to the conclusion to only provide the data for the years when we are confident about the taxonomic identification in order to dissuade users from misinterpreting the data.

However, now that you mentioned that bryozoan example, we're probably introducing false zeros for other species_codes that were created later in the time series by not using the YEAR_ADDED field in RACEBASE.SPECIES. Maybe we could still use the YEAR_ADDED field in RACEBASE.SPECIES as an initial time cutoff to exclude years before a given species_code existed (thereby removing false zeros), and then apply this second GAP_PRODUCTS.SPECIES_YEAR cutoff to reflect changes taxonomic confidence. It's kind of messy but definitely doable and I'm open to other thoughts.

sean-rohan-NOAA commented 7 months ago

Thanks for the detailed explanation, Zack! I wonder if one approach could be to evaluate which species have CPUEs before the first year they were identified to get a sense for how many species this could apply to.

Maybe another approach would be to figure out if there were sudden shifts in the rate of detection in our surveys (from zero to a regularly detection) since that might provide a useful reference point for determining whether we were either not identifying or misidentifying something prior to a certain year.

It does seems like if species don't show up in racebase prior to the year they were added and we had low ID confidence (based on Duane and Jerry's work), zero CPUEs prior to the year (or two years?) added may be a bit problematic.

Ned-Laman-NOAA commented 7 months ago

Is there any chance that part of the solution to these taxonomic changes lies in a connection to the Voucher and Taxonomic systems @SarahFriedman-NOAA is developing?

SarahFriedman-NOAA commented 7 months ago

Yes, Zack and I have discussed combining the species_year and the taxonomic_changes tables, which could then be used to cross-reference prior to CPUE calculations. This doesn't necessarily solve the discrepancy between when the species is described and when we are confident in the ID, but it would be a step. I'm very much open to discussing a good pipeline here and tweaking the taxonomic tables to reflect a better/more efficient workflow.

Duane-Stevenson-NOAA commented 7 months ago

What if we only provide zero-filled CPUE info beginning with the year the species code was first used? That wouldn't solve all the problems, but would avoid problems like zero-filling noodle bryozoans for all the years before we recognized them.

Duane

On Fri, Dec 8, 2023 at 11:46 AM Sean Rohan @.***> wrote:

Thanks for the detailed explanation, Zack! I wonder if one approach could be to evaluate which species have CPUEs before the first year they were identified to get a sense for how many species this could apply to.

Maybe another approach would be to figure out if there were sudden shifts in the rate of detection in our surveys (from zero to a regularly detection) since that might provide a useful reference point for determining whether we were either not identifying or misidentifying something prior to a certain year.

It does seems like if species don't show up in racebase prior to the year they were added and we had low ID confidence (based on Duane and Jerry's work), zero CPUEs prior to the year (or two years?) added may be a bit problematic.

— Reply to this email directly, view it on GitHub https://github.com/afsc-gap-products/gap_products/issues/16#issuecomment-1847757753, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKDWAUCUB7LQJHRFB3TYTDYINVCRAVCNFSM6AAAAABAM437TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXG42TONZVGM . You are receiving this because you were mentioned.Message ID: @.***>

-- Duane Stevenson, Ph.D. Supervisory Fish Biologist Groundfish Assessment Program NMFS, Alaska Fisheries Science Center

Ned-Laman-NOAA commented 7 months ago

I'm not sure I understand why it's potentially problematic to have CPUE = 0 for noodle bryozoan in 1982? Presumably it existed then even though we didn't identify it beyond bryozoan unid.? Perhaps this becomes a bigger issue after the 2 people who can identify noodle bryozoan are happily recording it while everyone else on other boats and cruise legs are rolling it up into the unid. bryozoan complex!

sean-rohan-NOAA commented 7 months ago

Because we know they're false zeros, which seems to be why NRS and SRS CPUEs aren't provided before 1996 but Lepidopsetta sp CPUEs are. NRS and SRS also existed before 1996.

On Mon, Dec 11, 2023 at 9:35 AM Ned Laman @.***> wrote:

I'm not sure I understand why it's potentially problematic to have CPUE = 0 for noodle bryozoan in 1982? Presumably it existed then even though we didn't identify it beyond bryozoan unid.? Perhaps this becomes a bigger issue after the 2 people who can identify noodle bryozoan are happily recording it while everyone else on other boats and cruise legs are rolling it up into the unid. bryozoan complex!

— Reply to this email directly, view it on GitHub https://github.com/afsc-gap-products/gap_products/issues/16#issuecomment-1850549881, or unsubscribe https://github.com/notifications/unsubscribe-auth/APULYIPFHEJOEYM5TXHNAIDYI477NAVCNFSM6AAAAABAM437TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJQGU2DSOBYGE . You are receiving this because you authored the thread.Message ID: @.***>

-- Sean K. Rohan, PhD (he/him) Research Fish Biologist Bering Sea Bottom Trawl Survey Group Resource Assessment and Conservation Engineering Division NOAA Alaska Fisheries Science Center 7600 Sand Point Way NE Seattle, WA 98115 @.***

Ned-Laman-NOAA commented 7 months ago

If that's the case, then the timing of definitive field identification will be the cutoff for producing nominal CPUEs, like Zack showed for the ATF/Kams. Would we base these cutoffs on the outcomes of the two Orr et al. and the Stevenson et al. ID confidence Tech Memos?

zoyafuso-NOAA commented 7 months ago

Lot of great thoughts here, thanks for the discussion. I think this is a new problem because we are not just providing data products on a very small subset of species/complexes but for all of the SPECIES_CODES. I have a few responses to the thread here and another tangent:

Ned: we could base the starting years off the taxonomic confidence tables provided in the Orr and Stevenson tech memos. Luckily, as part of the GAP_PRODUCTS production script, the CPUE tables used for FOSS are created and the taxonomic confidence grades that are used in those tech memos are appended to the table, i.e., 1 for high, 2 or medium, 3 for low taxonomic confidence. We have those as tables in a google folder here. Offline, I would need more information on some of the details of the tables and how to properly use them.
Is the YEAR_ADDED field in RACEBASE.SPECIES a reliable field to use as the starting year and if not, why does it exist? Is it really the year the species code was created? I don’t think it is, at least for all the records in RACEBASE.SPECIES (and Sean provided an example of that above).
I can see Duane’s idea of using the starting year for a SPECIES_CODE as the first year we observe it in RACEBASE.SPECIES but this may remove true zeros for years before the first year of a positive observation.
Sarah: I continue to be stymied by how taxonomic changes can occur over time but not be retroactively applied to the database but I am supportive of the creation of a better way to version our taxonomic information and integrate it with our products.

Now the tangent (but going back to the original issue) question is: why should we provide public-facing data for SPECIES_CODE values at a higher taxonomic resolutions than what is expected when we are identifying organisms on deck? For bryozoans, the minimum ID on deck is unid. Bryozoan (so, Phylum?) so that noodle Bryozoan could have been specifically identified by some particular person and maybe not have been consistently and correctly identified across vessels, survey years, and regions. We can think of other examples of this, e.g., Porifera, brachiopods, anemones, etc. Shouldn’t there be taxonomic consistency between what we identify on deck and the products we serve? The gapindex R package can handle complexes and so we could just provide data on Bryozoa, a taxon complex that contains all of the SPECIES_CODES with PHYLUM == "Bryozoa". Then, if someone outside of GAP wants data for particular bryozoan inverts at higher taxonomic resolutions, they go through the data-requests repo and then someone can be the GAP knowledge bearer and flesh out how well their request matches our level of taxonomic confidence.

sean-rohan-NOAA commented 7 months ago

Now the tangent (but going back to the original issue) question is: why should we provide public-facing data for SPECIES_CODE values at a higher taxonomic resolutions than what is expected when we are identifying organisms on deck? For bryozoans, the minimum ID on deck is unid. Bryozoan (so, Phylum?) so that noodle Bryozoan could have been specifically identified by some particular person and maybe not have been consistently and correctly identified across vessels, survey years, and regions. We can think of other examples of this, e.g., Porifera, brachiopods, anemones, etc. Shouldn’t there be taxonomic consistency between what we identify on deck and the products we serve? The gapindex R package can handle complexes and so we could just provide data on Bryozoa, a taxon complex that contains all of the SPECIES_CODES with PHYLUM == "Bryozoa". Then, if someone outside of GAP wants data for particular bryozoan inverts at higher taxonomic resolutions, they go through the data-requests repo and then someone can be the GAP knowledge bearer and flesh out how well their request matches our level of taxonomic confidence.

+1 to @zoyafuso-NOAA's idea. My interpretation is that CPUE tables are intended to be analysis-ready products and this seems like a reasonable option to provide analysis-ready products to meet user needs/goals. My sense is that would help reduce the incidence of conditional (and presence-only) records that happen to include CPUE.

Duane-Stevenson-NOAA commented 7 months ago

In theory, I don't have a problem with Zack's suggested approach of aggregating CPUE data to the minimum ID level. However, I suspect that it will be more labor-intensive. First, the code used to generate the CPUE table will have to do all the grouping somehow, and of course if one of our minimum ID standards changes, the code will have to be changed to reflect that. Second, I suspect that we'll get a lot more data requests for species-specific data, although it could certainly help us head off some problems with inappropriate data interpretations. Third, we have to decide if this approach is appropriate for our FOSS data as well.

On Tue, Dec 12, 2023 at 8:01 AM Sean Rohan @.***> wrote:

Now the tangent (but going back to the original issue) question is: why should we provide public-facing data for SPECIES_CODE values at a higher taxonomic resolutions than what is expected when we are identifying organisms on deck? For bryozoans, the minimum ID on deck is unid. Bryozoan (so, Phylum?) so that noodle Bryozoan could have been specifically identified by some particular person and maybe not have been consistently and correctly identified across vessels, survey years, and regions. We can think of other examples of this, e.g., Porifera, brachiopods, anemones, etc. Shouldn’t there be taxonomic consistency between what we identify on deck and the products we serve? The gapindex R package can handle complexes and so we could just provide data on Bryozoa, a taxon complex that contains all of the SPECIES_CODES with PHYLUM == "Bryozoa". Then, if someone outside of GAP wants data for particular bryozoan inverts at higher taxonomic resolutions, they go through the data-requests repo and then someone can be the GAP knowledge bearer and flesh out how well their request matches our level of taxonomic confidence.

+1 to @zoyafuso-NOAA https://github.com/zoyafuso-NOAA's idea. My interpretation is that CPUE tables are intended to be analysis-ready products and this seems like a reasonable option to provide analysis-ready products to meet user needs/goals. My sense is that would help reduce the incidence of conditional (and presence-only) records that happen to include CPUE.

— Reply to this email directly, view it on GitHub https://github.com/afsc-gap-products/gap_products/issues/16#issuecomment-1852329509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKDWAVC2QCW4TYPNNPIH7LYJB5TPAVCNFSM6AAAAABAM437TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSGMZDSNJQHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Duane Stevenson, Ph.D. Supervisory Fish Biologist Groundfish Assessment Program NMFS, Alaska Fisheries Science Center

sean-rohan-NOAA commented 7 months ago

On @Duane-Stevenson-NOAA's second point, although there may be more data requests, would having that friction in the process help to fulfill user needs/goals?

In case it's helpful, REEM uses a look-up table of standard taxonomic levels for prey taxa, which I think they still use to generate taxonomically aggregated data products. The table helps handle time-varying ID standards and maintain aggregated data that can be updated when taxa are added or revised:

https://apps-afsc.fisheries.noaa.gov/REFM/REEM/DietData/dietmap.html

I'm wondering if Geoff Lang would have some ideas about the pros/cons.

Ned-Laman-NOAA commented 7 months ago

Great points all around.

I also don't have a problem with aggregating species to the minimum ID level for CPUE calculations. There is some consistency in that and we ought to be able to leverage the taxonomic system for the aggregation. Duane is right, though, that if minimum ID changes are implemented, there would be some maintenance on a minimum ID lookup table, but don't we do something like this in the GOA-AI for the Ops Manual annually anyway? We could operationalize this in Oracle. However, Sarah should weigh in about maintenance of the minimum ID levels and lists before we get too far down that road.

Another item of note is that the ID confidence assignments vary regionally (e.g., Mycale loveni 91040: [region(confidence)year] AI(3)1991; EBS(1)1982; GOA(3)1993) so that unifying all of this info into a normalized table will take a little work .

I'm not totally sure how to use the YEAR_ADDED field. It looks to me like the minimum YEAR_ADDED was 1974, while the minimum field ID year from RACEBASE.CATCH is 1953 for ATF. I'm not really sure how long the RACEBASE.SPECIES table's history is (1974?). Does anyone know if it migrated from the Burroughs to Oracle? or was it created for the first time in Oracle which would've been in the late 80's or early 90s? I'm wondering if there was a default valued used in the YEAR_ADDED field corresponding to the instantiation of the SPECIES table.

SarahFriedman-NOAA commented 7 months ago

The minimum ID table is theoretically updated annually, though changes are rarely made, so I imagine it wouldn't heavily impact CPUE calculations in the future. It seems like this would be a straightforward way to resolve the issue.

zoyafuso-NOAA commented 6 months ago

So unfortunately I think this probably requires a meeting in the new year to sort out the priorities of these potential changes because I agree that there will be some level of work involved here. I've laid out a general attack below as to how to execute this within the GAP_PRODUCTS Oracle schema and gap_products repo.

I don’t think we need to put forth all of these changes stated below immediately because the inclusion of many of these invertebrate data are already brand new additions to what users are expecting with our data product tables (e.g., none of the current tables include any shrimp taxa). And through all of this, I am not sure what the implications to the FOSS tables are, but I imagine these changes will be mirrored in the FOSS CPUE tables.

Inverts and Fish: only SPECIES_CODES at the species level are included in the tables, e.g., excludes records for fish genera like Careproctus sp. or Lepidopsetta sp. Refer to EFH documents, the Taxonomic Confidence tables, and our taxonomists for fish/invert species SPECIES_CODE values that were created post 1982 already tracked in GAP_PRODUCTS.SPECIES_YEAR. Would excluding SPECIES_CODE values for these complexes affect how stock assessors get their data for stock complexes (REBS, rock soles, etc.)?
Taxonomic aggregations at the Phylum level are included in the tables using the SPECIES_CODE values that code for phyla (e.g., Bryozoans 95000). I would imagine that the Phylum umbrella precludes the need for taxonomic confidence
A lookup table would need to be created that lists SPECIES_CODE and GROUP (aggregate taxonomic SPECIES_CODE) and potentially START_YEAR. This would be the format used by gapindex::get_data()
Invertebrates with minimum ID levels not at Phylum or species are included. Make sure that there are SPECIES_CODE values for these taxonomic aggregations and if not, create new SPECIES_CODE values (is this an easy task?), e.g., there are species codes for shrimp species Pseudoliomesus ooides and Pseudoliomesus canaliculatus but not Pseudoliomesus sp. as a genus. Refer to our taxonomists for SPECIES_CODE values that were created post 1982. Taxonomic confidence are only for species-level identifications (right?) so we don’t need to refer to these tables?

SarahFriedman-NOAA commented 5 months ago

There is a related complication to this topic. There are some species that are re-identified based on geographic range, such that the original name is still valid but the species we encounter is under a different name. For example, we have many records of Elassodiscus caudatus from Alaska, but due to a 2020 publication, it has come to my attention that the AK species is now referred to as E. nyctereutes, and E. caudatus is the species located further south that we do not encounter on our surveys. What should we do with the historical E. caudatus records in our database? Especially now that we will need a new species code. We cannot be sure that they are in fact E. nyctereutes. How do we calculate CPUE?

I know this is a snailfish example, but this situation does crop up from time to time and I don't think we have a standardized approach for documenting/dealing with these records.

sean-rohan-NOAA commented 5 months ago

Interested to see how this is handled. When this happened with sand lance in 2015, the species_name for code 20202 was changed from Ammodytes hexapterus to Ammodytes sp. and the common_name changed from Pacific sand lance to sand lance unid. New codes were added for Ammodytes hexapterus (20203) and Ammodytes personatus (20204).

On Tue, Jan 30, 2024 at 10:59 AM Sarah Friedman @.***> wrote:

There is a related complication to this topic. There are some species that are re-identified based on geographic range, such that the original name is still valid but the species we encounter is under a different name. For example, we have many records of Elassodiscus caudatus from Alaska, but due to a 2020 publication, it has come to my attention that the AK species is now referred to as E. nyctereutes, and E. caudatus is the species located further south that we do not encounter on our surveys. What should we do with the historical E. caudatus records in our database? We cannot be sure that they are in fact E. nyctereutes. How do we calculate CPUE?

I know this is a snailfish example, but this situation does crop up from time to time and I don't think we have a standardized approach for documenting/dealing with these records.

— Reply to this email directly, view it on GitHub https://github.com/afsc-gap-products/gap_products/issues/16#issuecomment-1917696980, or unsubscribe https://github.com/notifications/unsubscribe-auth/APULYIOQI7S3PTZXHZZ7XTLYRE7KRAVCNFSM6AAAAABAM437TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXGY4TMOJYGA . You are receiving this because you authored the thread.Message ID: @.***>

-- Sean K. Rohan, PhD (he/him) Research Fish Biologist Bering Sea Bottom Trawl Survey Group Resource Assessment and Conservation Engineering Division NOAA Alaska Fisheries Science Center 7600 Sand Point Way NE Seattle, WA 98115 @.***

SarahFriedman-NOAA commented 5 months ago

@sean-rohan-NOAA is there documentation of that decision somewhere or is this word-of-mouth knowledge that has been passed around? I think we probably need some better documentation if it is the latter. Perhaps adding more detail to the SPECIES_YEAR or taxonomic_changes table? This is a slightly different situation from the snailfish example above, but similar issues to the Ammodytes situation are known with blackspotted/rougheye, NRS/SRS, etc. I do not know what happened to the historical records once the modern taxonomy was realized in those instances.

@Duane-Stevenson-NOAA What about the Lycodes beringi/L. diapterus(?) example we discussed, which is probably most analogous to the current snailfish dilemma?

sean-rohan-NOAA commented 5 months ago

No clue. I heard about the change through word-of-mouth when I was in the food habits lab. In the lab, we discussed whether GOA and Chukchi records in the foodlab tables should remain A. personatus in the GOA and the A. hexapterus in the Chukchi based on the ranges from Orr et al. (2015). Can't remember what decision was made.

Duane-Stevenson-NOAA commented 5 months ago

We do have an exact precedent for this. In a paper published in 2009, Stevenson and Sheiko determined that the species previously known as Lycodes diapterus from the west coast through Alaska actually represents two species. The true L. diapterus is a west coast species, ranging north only as far at Vancouver Island. The other form, L. beringi, is the form found throughout Alaska's waters. Our data response to that was to change all L. diapterus records in RACEbase from Alaska (ie, not from Region = WC) to L. beringi. Thus, currently there are no records of L. diapterus in RACEbase, other than those from historical west coast surveys.

It appears that the Malacocottus issue, which is very similar to the Lycodes issue (Stevenson, 2015) was handled in the same way, though there are a couple of "leaked" M. kincaidi records that have appeared recently.

Duane

Stevenson, DE, and Sheiko BA. 2009. Clarification of the Lycodes diapterus species complex (Perciformes: Zoarcidae), with comments on the subgenus Furcimanus. Copeia 2009: 125-137.

Stevenson, DE. 2015. The validity of nominal species of Malacocottus (Teleostei: Cottiformes: Psychrolutidae) known from the eastern North Pacific with a key to the species. Copeia 103: 22-33.

On Tue, Jan 30, 2024 at 11:00 AM Sarah Friedman @.***> wrote:

There is a related complication to this topic. There are some species that are re-identified based on geographic range, such that the original name is still valid but the species we encounter is under a different name. For example, we have many records of Elassodiscus caudatus from Alaska, but due to a 2020 publication, it has come to my attention that the AK species is now referred to as E. nyctereutes, and E. caudatus is the species located further south that we do not encounter on our surveys. What should we do with the historical E. caudatus records in our database? We cannot be sure that they are in fact E. nyctereutes. How do we calculate CPUE?

I know this is a snailfish example, but this situation does crop up from time to time and I don't think we have a standardized approach for documenting/dealing with these records.

— Reply to this email directly, view it on GitHub https://github.com/afsc-gap-products/gap_products/issues/16#issuecomment-1917696980, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKDWAROHPRDN4GNZ3IZUTTYRE7KRAVCNFSM6AAAAABAM437TSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXGY4TMOJYGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Duane Stevenson, Ph.D. Supervisory Fish Biologist Groundfish Assessment Program NMFS, Alaska Fisheries Science Center

Ned-Laman-NOAA commented 5 months ago

I'm enjoying spectating the pisces-taxo-nerd-out since I didn't have much expertise to contribute to the specifics of the question.

I do agree with Sarah that we need to consider how to formally and reproducibly document how these kinds of taxonomic alterations are decided and implemented. These will become business rules that we apply to a taxonomic system to preserve original naming and subsequent taxonomy action. Perhaps we need to create a document that we populate with all of these cases and how they were resolved (to the best of our ability) and have it on hand when OFIS designs our new taxonomic system.

EmilyMarkowitz-NOAA commented 5 months ago

Great discussion all. I want to add here (I hope in parallel to this discussion), that I propose 2 new columns the SPECIES_YEAR table (names negotiable):

SPECIES_CODE_PREV: The species code that these species should be grouped into before the year in the YEAR_STARTED column
YEAR_IMPLIMENTED: A column documenting when we implemented this rule.

This will allow 1) us to be able to calculate total biomass for species in past years (which we can't technically do if we simply remove questionable ID catch observations from the data) and 2) may be infrastructure we can use for species grouping also described here. If we want species to be grouped, there is little stopping us from adding those groups to this table.

[x] I have already checked with @Duane-Stevenson-NOAA to make sure the species_codes in SPECIES_CODE_PREV column are correct.
- [ ] The only outstanding question in this SPECIES_CODE_PREV table is what we do about Alaska skate (Arctoraja parmifera; species_code 471), which is not a Bathyraja uid. (405), but would have been considered one at the time.
[x] Changes have been added in future oracle SPECIES_YEAR to this effect and it is ready for the next GAP_PRODUCTS run.
[ ] Make sure these new columns, if everyone agrees on them, are documented in the METADATA_COLUMN table
[ ] Zack, you'll have to change how gapindex uses this table to "cut off" species timelines, right? Currently, the data for these species is removed from the public data, but now it would be complexed into a higher taxonomic grouping.

SarahFriedman-NOAA commented 5 months ago

The additional columns make the SPECIES_YEAR table even more redundant with the TAXONOMIC_CHANGES table, which has columns for both OLD_SPECIES_CODE and NEW_SPECIES_CODE as well as year. I am becoming more convinced that these efforts can be combined.

Ned-Laman-NOAA commented 5 months ago

Is the plan to capture intermediate changes in identification (i.e., a species code gets changed more than once OR a species code remains the same but the name gets synonymized once or twice)? Recommend unambiguous field naming conventions. I figured it out in the above, but not before asking myself "The YEAR that which was STARTED or IMPLEMENTED?"

EmilyMarkowitz-NOAA commented 5 months ago

Great point, Sarah. I agree - I also think these can be combined. To Ned's first question, Sarah's TAXONOMIC_CHANGES (as I recall) allow us to capture those intermediate changes in identification. Do we need to add YEAR/_CHANGED/_STARTED/_IMPLEMENTED (or whatever we'd like to call them; good points, Ned!) to the TAXONOMIC_CHANGES?

SarahFriedman-NOAA commented 5 months ago

I imagine year would just mean year implemented, no? Is there any reason to have a record of a name change that is not implemented in our data naming procedures?

@Ned-Laman-NOAA the taxonomic_changes table is designed to capture both cases. Very much open to changing the structure of the table to make it more useful for our purposes.

zoyafuso-NOAA commented 4 months ago

Reviving this issue in the hopes of closing it:

In light of the very useful discussion in this thread, it seems there are issues simply providing data on all of the unique SPECIES_CODE in RACEBASE.CATCH in the GAP_PRODUCT tables. There are also issues with some of the groupings that are consistent with the minimum on-deck ID standards. Both issues stem from how taxonomic changes are (or are not) handled in our database, confounded by changes in taxonomic confidence over time, among other issues discussed above.

Because many of these taxa (mostly invertebrate taxa) are “new” data that we haven't historically shared in our standard data products, I think we have reason to stage the release taxa outside of those historically provided over a couple years to give us more time to discuss about how to deal with some of the nuances of taxonomic changes and confidence over time with some of these taxonomic groupings. I am working off of this spreadsheet which was created from the minimum ID table on the Survey App for the groupings.

Stage 1 (2024, perhaps as early as the April 2024 version of GAP_PRODUCTS that gets shared with AKFIN)

Invert taxa where the minimum taxonomic ID on deck also wholly describes the taxonomic grouping. For example, Bryozoa is minimally ID’d to phylum on deck and is also defined by PHYLUM_TAXON == ‘Bryozoa’ and so even if there are taxonomic changes to a bryozoan name or species code, it still lives under the Byrozoa (95000) umbrella. This masking should either address or make moot the original issue here. We should have an accompanying table in GAP_PRODUCTS that tells you which SPECIES_CODE values consist of each grouping. This will also make it easy for us to check that SPECIES_CODE values aren’t being duplicated across groupings for whatever reason.
All SPECIES_CODE values related to fish taxa. Lycodes spp., and myctophids are aggregated to genus. This will include SPEICES_CODE values that are not to species level (e.g., 10260 rock sole unid.). The same filtering we do with the GAP_PRODUCTS.SPECIES_YEAR table will occur to account for new fish taxa and our taxonomists can let us know of other SPECIES_CODE values that have a different starting year.
Invert taxa where the minimum ID on deck is to species, e.g. squids, sea stars.

I pushed a script (e2b2f5d55e0c256ceb5c5ee320fa28662aabfec1) that integrates that spreadsheet and makes a dataframe in a form that gapindex could use to pull data for a mixture of individual species codes and complexes.

Stage 2 (Some time in the future) The cases where inverts aren't minimally ID'd to species-level or the level that defines the group, e.g., shrimps are contained within Infraorder Caridea but have a minimum ID on-deck standard of genus, is where all those issues of taxonomic changes, confidence over time, new species, and other logistics come into play. For these cases, we have a case to delay the sharing of those taxa in GAP_PRODUCTS until we flesh out those details. And I think that should continue to occur but outside of this GitHub issue, perhaps to something more formalized like a working group or something (it makes me sad to see a long text chain like without resolution).

Duane-Stevenson-NOAA commented 4 months ago

This seems like a reasonable approach to me.

sean-rohan-NOAA commented 4 months ago

I wouldn't have any concerns about that approach as a near-term solution for GAP_PRODUCTS.

SarahFriedman-NOAA commented 4 months ago

That sounds like a fair near-term solution to me. And, yes please on the working group... I've been grappling with a lot of this stuff on my own and it would help to have some other folks involved to bounce ideas off of.

afsc-gap-products / gap_products

Time cut-offs for including species in CPUE tables #16

Issue