Template species list missing superseded species names but also missing their new species name

LizziOh commented 10 months ago

One of the reasons divers ask for updated templates before going on a trip is to get the latest list of species names changes integrated with their data entry. It seems that the new species names are missing (perhaps because they lack observations under that name in the database) but also the superseded species name is missing form the template.

For example, when Cheilodactylus ephippium has now recently been superseded by Morwong ephippium and this superseding has been updated in the database. Divers headed to Lord Howe Island want the latest template reflecting name changes, and this fish species has commonly been sighted there. There are 716 rows for species_name Morwong ephippium in the M1 endpoint from Lord Howe Island, however the recorded species name is Cheilodactylus ephippium for all of them - so the db observation table has only records of the old name. When creating a template for the location Lord Howe Island neither species shows up in the m1 species list.

This needs to be rectified as we cannot keep track of the particular fish we have updated and need to be manually added to a location's template species lists. I suggest adding the new names, perhaps by applying the supersedings to the list of species before removing the superseded names. Or even using the m1 endpoints to guide the template data creation.

bpasquer commented 10 months ago

I confirm the issue with the example species both missing from the template.

But running the script in the prod DB to generate the list of observable id and codes does give a record for Cheilodactylus ephippium(because it's been observed in the area) but with the code originating fromMorwong ephippium name(because it's been superseded).

The script to generate data for the template is in api/src/main/java/au/org/aodn/nrmn/restapi/data/repository/LetterCodeRepository.java

WITH stage0 AS (SELECT COALESCE(obsitem.superseded_by, obsitem.observable_item_name) AS species_name, obsitem.observable_item_id, SUM(obs.measure_value) AS abundance 
            FROM nrmn.location_ref loc 
            INNER JOIN nrmn.site_ref site_raw ON site_raw.location_id = loc.location_id 
            INNER JOIN nrmn.ep_site_list site ON site.site_code = site_raw.site_code
            INNER JOIN nrmn.survey sur ON sur.site_id = site_raw.site_id 
            INNER JOIN nrmn.survey_method surmet ON surmet.survey_id = sur.survey_id 
            INNER JOIN nrmn.observation obs ON obs.survey_method_id = surmet.survey_method_id 
            INNER JOIN nrmn.observable_item_ref obsitem ON obsitem.observable_item_id = obs.observable_item_id 
            INNER JOIN nrmn.methods_species ms ON obsitem.observable_item_id = ms.observable_item_id 
            WHERE site_raw.site_code like 'LH%' AND ms.method_id = 1
            GROUP BY obsitem.observable_item_id, species_name, common_name),
stage1 AS (SELECT *,nrmn.abbreviated_species_code(species_name, 3) AS code_len3 FROM stage0),
stage2 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len3 order by abundance desc) AS code_len3_rank FROM stage1),
stage3 AS (SELECT *, CASE WHEN code_len3_rank = 1 THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 4) END code_len4 FROM stage2),
stage4 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len4 order by abundance desc) AS code_len4_rank FROM stage3),
stage5 AS (SELECT *, CASE WHEN (code_len4 is NULL or code_len4_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5) END code_len5 FROM stage4),
stage6 AS (SELECT *, CAST(ROW_NUMBER() over (partition by code_len5 order by abundance desc) AS INT) AS code_len5_rank FROM stage5),
stage7 AS (SELECT *, CASE WHEN (code_len5 is NULL or code_len5_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5 + code_len5_rank - 1) END AS code_len_max FROM stage6)
SELECT oir.observable_item_id as observableItemId ,LOWER(COALESCE(s7.code_len_max, s7.code_len5, s7.code_len4, s7.code_len3)) AS letterCode FROM nrmn.observable_item_ref oir JOIN stage7 s7 on s7.observable_item_id = oir.observable_item_id 
JOIN nrmn.obs_item_type_ref oitr ON oitr.obs_item_type_id = oir.obs_item_type_id 
WHERE oitr.obs_item_type_id = ANY (ARRAY[1, 2]) ORDER BY oir.observable_item_id

The SQL script takes superseding into account at the first step of the whole 7steps sequence (stage0) However, because the output from the all sequence is the observable_item_id and code, the superseding information is lost. @utas-raymondng could the observable_item_name be used in the output in place of observable_item_id?

But that doesn't explain why neither of the species show up in the template generated from the API....

utas-raymondng commented 10 months ago

I confirm the issue with the example species both missing from the template.

But running the script in the prod DB to generate the list of observable id and codes does give a record for Cheilodactylus ephippium(because it's been observed in the area) but with the code originating fromMorwong ephippium name(because it's been superseded).

The script to generate data for the template is in api/src/main/java/au/org/aodn/nrmn/restapi/data/repository/LetterCodeRepository.java

WITH stage0 AS (SELECT COALESCE(obsitem.superseded_by, obsitem.observable_item_name) AS species_name, obsitem.observable_item_id, SUM(obs.measure_value) AS abundance 
            FROM nrmn.location_ref loc 
            INNER JOIN nrmn.site_ref site_raw ON site_raw.location_id = loc.location_id 
            INNER JOIN nrmn.ep_site_list site ON site.site_code = site_raw.site_code
            INNER JOIN nrmn.survey sur ON sur.site_id = site_raw.site_id 
            INNER JOIN nrmn.survey_method surmet ON surmet.survey_id = sur.survey_id 
            INNER JOIN nrmn.observation obs ON obs.survey_method_id = surmet.survey_method_id 
            INNER JOIN nrmn.observable_item_ref obsitem ON obsitem.observable_item_id = obs.observable_item_id 
            INNER JOIN nrmn.methods_species ms ON obsitem.observable_item_id = ms.observable_item_id 
            WHERE site_raw.site_code like 'LH%' AND ms.method_id = 1
            GROUP BY obsitem.observable_item_id, species_name, common_name),
stage1 AS (SELECT *,nrmn.abbreviated_species_code(species_name, 3) AS code_len3 FROM stage0),
stage2 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len3 order by abundance desc) AS code_len3_rank FROM stage1),
stage3 AS (SELECT *, CASE WHEN code_len3_rank = 1 THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 4) END code_len4 FROM stage2),
stage4 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len4 order by abundance desc) AS code_len4_rank FROM stage3),
stage5 AS (SELECT *, CASE WHEN (code_len4 is NULL or code_len4_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5) END code_len5 FROM stage4),
stage6 AS (SELECT *, CAST(ROW_NUMBER() over (partition by code_len5 order by abundance desc) AS INT) AS code_len5_rank FROM stage5),
stage7 AS (SELECT *, CASE WHEN (code_len5 is NULL or code_len5_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5 + code_len5_rank - 1) END AS code_len_max FROM stage6)
SELECT oir.observable_item_id as observableItemId ,LOWER(COALESCE(s7.code_len_max, s7.code_len5, s7.code_len4, s7.code_len3)) AS letterCode FROM nrmn.observable_item_ref oir JOIN stage7 s7 on s7.observable_item_id = oir.observable_item_id 
JOIN nrmn.obs_item_type_ref oitr ON oitr.obs_item_type_id = oir.obs_item_type_id 
WHERE oitr.obs_item_type_id = ANY (ARRAY[1, 2]) ORDER BY oir.observable_item_id

The SQL script takes superseding into account at the first step of the whole 7steps sequence (stage0) However, because the output from the all sequence is the observable_item_id and code, the superseding information is lost. @utas-raymondng could the observable_item_name be used in the output in place of observable_item_id?

But that doesn't explain why neither of the species show up in the template generated from the API....

@bpasquer

Correct me if I am wrong, in the output template, there are few csv inside the zip. Each file (m1species, m2species, m3species) contains "Species Name" and "Common Name", so in this case the expected value for "Species Name" should be the supersededby name if exist?

If above is correct then the problem is with getSpeciesAttributesByIds where it always return species name


        HashMap<Long, String> letterCodeMap = new HashMap<Long, String>();
        List<LetterCodeMapping> letterCodeMappings = letterCodeRepository.getForMethodWithSiteIds(mode, siteIds);
        letterCodeMappings.forEach(m -> letterCodeMap.put(Long.valueOf(m.getObservableItemId()), m.getLetterCode()));

        List<ObservableItemRow> observableItemRows = observableItemRepository.getAllWithMethodForSites(mode, siteIds);

        var observableItemIds = observableItemRows.stream().mapToInt(ObservableItemRow::getObservableItemId).toArray();
        List<SpeciesWithAttributesCsvRow> species = observationRepository.getSpeciesAttributesByIds(observableItemIds)
``

bpasquer commented 10 months ago

Correct me if I am wrong, in the output template, there are few csv inside the zip. Each file (m1species, m2species, m3species) contains "Species Name" and "Common Name", so in this case the expected value for "Species Name" should be the supersededby name if exist?

That's correct. Then the superseding issue would be resolved if the SQL query returned the superseded ID. But that doesn't explains why neither of the species appear in the file. I'll continue looking at it.

utas-raymondng commented 10 months ago

Correct me if I am wrong, in the output template, there are few csv inside the zip. Each file (m1species, m2species, m3species) contains "Species Name" and "Common Name", so in this case the expected value for "Species Name" should be the supersededby name if exist?

That's correct. Then the superseding issue would be resolved if the SQL query returned the superseded ID. But that doesn't explains why neither of the species appear in the file. I'll continue looking at it.

What I mean is the code just return the observationItemId and the last line observationRepository.getSpeciesAttributesByIds(observableItemIds) lookup the Species Name by default Not the supersed by name event it exist, if this function lookup correctly then you should see value correct.

bpasquer commented 10 months ago

OK, but actually, the problem originates in the SQL query, because the final select relies on the observable_item_id instead of the superseded_by name.

SELECT oir.observable_item_id as observableItemId ,LOWER(COALESCE(s7.code_len_max, s7.code_len5, s7.code_len4, s7.code_len3)) AS letterCode FROM nrmn.observable_item_ref oir JOIN stage7 s7 on s7.observable_item_id = oir.observable_item_id

So I guess if we get that query right we should be good

bpasquer commented 9 months ago

I have reworked the query so that it only outputs current species letter codes. The solution is to group on the current species instead of the observable_item id on stage 0 and to join on species_name in stage1.

with stage0 as(SELECT COALESCE(obsitem.superseded_by, obsitem.observable_item_name) AS species_name, sum(obs.measure_value) as abundance
            FROM nrmn.location_ref loc
            INNER JOIN nrmn.site_ref site_raw ON site_raw.location_id = loc.location_id
            INNER JOIN nrmn.survey sur ON sur.site_id = site_raw.site_id
            INNER JOIN nrmn.survey_method surmet ON surmet.survey_id = sur.survey_id
            INNER JOIN nrmn.observation obs ON obs.survey_method_id = surmet.survey_method_id
            INNER JOIN nrmn.observable_item_ref obsitem ON obsitem.observable_item_id = obs.observable_item_id
            INNER JOIN nrmn.methods_species ms ON obsitem.observable_item_id = ms.observable_item_id
            WHERE site_raw.site_code like 'LH%' and ms.method_id=1 group by species_name),
stage1 AS (SELECT *,nrmn.abbreviated_species_code(species_name, 3) AS code_len3 FROM stage0 join nrmn.observable_item_ref oir on oir.observable_item_name=stage0.species_name),
stage2 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len3 order by abundance desc) AS code_len3_rank FROM stage1),
stage3 AS (SELECT *, CASE WHEN code_len3_rank = 1 THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 4) END code_len4 FROM stage2),
stage4 AS (SELECT *, ROW_NUMBER() OVER (partition by code_len4 order by abundance desc) AS code_len4_rank FROM stage3),
stage5 AS (SELECT *, CASE WHEN (code_len4 is NULL or code_len4_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5) END code_len5 FROM stage4),
stage6 AS (SELECT *, CAST(ROW_NUMBER() over (partition by code_len5 order by abundance desc) AS INT) AS code_len5_rank FROM stage5),
stage7 AS (SELECT *, CASE WHEN (code_len5 is NULL or code_len5_rank = 1) THEN NULL ELSE nrmn.abbreviated_species_code(species_name, 5 + code_len5_rank - 1) END AS code_len_max FROM stage6)
SELECT oir.observable_item_id as observableItemId ,LOWER(COALESCE(s7.code_len_max, s7.code_len5, s7.code_len4, s7.code_len3)) AS letterCode
FROM nrmn.observable_item_ref oir
JOIN stage7 s7 on s7.observable_item_id = oir.observable_item_id
JOIN nrmn.obs_item_type_ref oitr ON oitr.obs_item_type_id = oir.obs_item_type_id
WHERE oitr.obs_item_type_id = ANY (ARRAY[1, 2]) ORDER BY letterCode

utas-raymondng commented 9 months ago

Ticket raised https://github.com/aodn/backlog/issues/5267

aodn / nrmn-application

Template species list missing superseded species names but also missing their new species name #1333