Correction of historical data -risk of metadata loss

bpasquer commented 1 year ago

Historical observation have sometimes attributes attached. These attributes are stored in the jsonb fields observation_attribute:

select jsonb_object_keys("observation_attribute"), count(*)
from observation
group by jsonb_object_keys("observation_attribute")

jsonb_object_keys	count
Notes	4217
Biomass	1137406
SizeRaw	15578
DescriptiveName	80791
SizeClassRaw	26842
LegalSize	150
SizeEstimated	15596
SpeciesSex	32058
SimulatedAbsence	300
SizeClassEstimated	26864

Similarly, some metadata are stored in the survey_method table:

select 'observation', jsonb_object_keys("survey_method_attribute"), count(*)
from survey_method
group by jsonb_object_keys("survey_method_attribute")

jsonb_object_keys	count
NonStandardData	255
LegacyMethod	1808

There is a risk of metadata loss when historical data are corrected. How can we minimize the risk?

LizziOh commented 1 year ago

@atcooper1 we can see the details of these observation traits with this code (change 'Notes' to the other attributes to see details): select distinct(observation_attribute->'Notes') from nrmn.observation where observation_attribute->'Notes' is not null

Most attributes relate to fine scale sizes, whether the animal (eg. lobster) was estimated for size, and its sex (also used for wrasse historically). However these seem to be duplicated, eg. 'SizeRaw' and 'SizeClassRaw' where I cannot tell the difference, perhaps the same thing from the 2 historical databases? 'Notes' mostly refer to RLS species ID changes - I'm unsure the value of these. 'Biomass' may be redundant - as this is now calculated on data-out based on the stored a's and b's??? 'DescriptiveName' is a mystery to me and I'll have to look into it further.

Moving forward the metrics related to lobster/abalone raw sizes, sex and estimates could be moved to a legacy endpoint, eg incorporated into the ep_lobster_haliotis and no longer attached to observation_ids (but still attached to survey_ids etc.).

LizziOh commented 1 year ago

There has been confirmation that loss of these historical data is acceptable in the database and legacy data of these can be generated and available externally. Record of this disscussion: RE_ Database corrections mechanism - potential loss of observation-level metadata.pdf

File generated to demonstrate and discuss different examples of the metadata at risk: Observation attributes json.xlsx

bpasquer commented 1 year ago

A version1 extract of

legacy_observation_attributes
legacy_survey_method_attributes

were shared, and Lizzie's and Toni (in italic) feedbacks were as follows:

Thank you for the extracts. For the basic format I think that we should include survey_date, depth, site_code, site_date and diver as columns.

For the legacy_observation_attributes, I think they require some cleaning.

Biomass should potentially be excluded as it is incomplete – what do you think Toni? This would allow the extract to be much shorter and focussed on notes, animal sex, and high res size categories. Agreed, get rid of biomass The various columns for sizes should be merged into a meaningful variable. Currently they are very convoluted and confusing and in some years will mean nothing to whomever is looking at the data without proper metadata descriptions.

Size estimated and size class estimated should be merged as a singular yes/no field

Sizeraw is a value in mm relating to invertebrate observations and should be kept as is.

Size class estimated is a weird mix of measurements in inches for fish and cephalopods, but measurements in mm for invertebrates. The fish ones should be deleted and disregarded. The invertebrate measurements should be kept and merged with “Sizeraw” “simulated absence” is a no species found record. These should be deleted as is this not preserved/re-ingested upon corrections??

Notes should stay as is, but may need some review / discussion with Toni as to whether to keep or separate the automated RLS ones referring to species and size corrections. Looking at the RLS notes, we think it’s probably best to remove these from the legacy data all together. They are pretty meaningless now that Species ID has changed in the NRMN DB, and we really don’t need to know that diver initials were updated with accents! Suggest removing all Notes %Added # from SpeciesID%, %SpeciesID Changed%, %Diver reference data%, and %Size data removed%.

For the survey_method attributes, I feel like it is better to convert these to survey_notes stored at the survey level as this is where they are relevant (and seem to be repeated for every element of the survey-method in the survey_id’s anyway. This way the extract is redundant, and we probably don’t need it at all. There are only 5 types of non-standard data attributes, all applicable at the survey level:

Site sampled due to oil spill

Method error

Poor visibility

Additional data

Carried out on seagrass bed

“Method error” are for surveys from one day and seem to be explained in the survey notes anyway - but those notes can still be appended with “method error”. NB. All of these should not override existing survey_notes but be appended to them. Poor visibility seems to be subjective and probably inconsistent so I would probably exclude these unless the visibility value is missing for that survey.

For the legacy method the two attribute values are 8 and 9. These stand for the different legacy methods described below. These data were converted to the standardised fish blocks when other ATRC data were merged and block abundances were simulated. They could be added as survey note descriptions as well (eg. “Legacy method 8: Parallel fish survey 50 x 10 m blocks over 4 surveys”) - this may be useful if ever checking the original survey sheets, but the downside is that the survey notes might confuse people. Do you have an opinion, Toni? I tend to think just delete them.

| 8 | Parallel fish survey 50 x 10 m blocks | e.g. Jervis Bay 2007 4 transects: 1,2,3,4 | Total: each fish species Size: all species; estimated in inch size categories Sex: when available | | 9 | Parallel fish survey 100 x 5 m blocks | e.g. Batemans Bay 2005, 2006 4 transects: 1,2,3,4 | Total: each fish species Size: all species; estimated in inch size categories Sex: when available |

Thanks team,

bpasquer commented 1 year ago

The following query was used to clean and extract the observation attributes( I will share the resulting attribute list in an email)

The query :

excludes "Biomass", "SimulatedAbscence" from the extract
merges "SizeEstimated" and "SizeClassEstimated" into one 'Yes/No' field "Estimated".
merges "SizeRaw" and "SizeClassRaw" into one "Size" field
excludes Fish(fishes are excluded based on their Phylum 'Chordata') and Cephalopods records of "SizeRaw"/"SizeClassRaw"

removes Notes like "%Added # from SpeciesID%","%SpeciesID Changed%", "%Diver reference %", and "%Size data removed%".


SELECT sm.survey_id,
   survey_date,
   depth,
   site_code,
   site_name,
   full_name AS diver,
   block_num,
   method_id,
   obs.observable_item_id,
   observable_item_name,
   obs.measure_id,
   measure_name,
   obs.measure_value,
   observation_attribute::JSONB->>'DescriptiveName' AS "DescriptiveName",
   observation_attribute::JSONB->>'LegalSize' AS "LegalSize",
   observation_attribute::JSONB->>'SpeciesSex' AS "SpeciesSex",
   concat(observation_attribute::JSONB->>'SizeRaw', observation_attribute::JSONB->>'SizeClassRaw') AS size,
   CASE
       WHEN concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')='NoNo' THEN 'No'
       WHEN concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')='YesYes' THEN 'Yes'
       ELSE concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')
   END AS estimated,
observation_attribute::JSONB->>'Notes' AS "Notes"
FROM observation obs
LEFT JOIN observable_item_ref oir ON oir.observable_item_id = obs.observable_item_id
LEFT JOIN measure_ref mr ON mr.measure_id = obs.measure_id
LEFT JOIN survey_method sm ON obs.survey_method_id = sm.survey_method_id
LEFT JOIN survey s ON sm.survey_id = s.survey_id
LEFT JOIN diver_ref dr ON obs.diver_id = dr.diver_id
LEFT JOIN site_ref sr ON s.site_id = sr.site_id
WHERE (observation_attribute::JSONB->>'SizeRaw' IS NOT NULL
   OR observation_attribute::JSONB->>'SizeClassRaw' IS NOT NULL
   OR observation_attribute::JSONB->>'SizeEstimated' IS NOT NULL
   OR observation_attribute::JSONB->>'SizeClassEstimated' IS NOT NULL
   OR observation_attribute::JSONB->>'DescriptiveName' IS NOT NULL
   OR observation_attribute::JSONB->>'LegalSize'IS NOT NULL
   OR observation_attribute::JSONB->>'SpeciesSex' IS NOT NULL
   OR observation_attribute::JSONB->>'Notes' IS NOT NULL
)
EXCEPT
SELECT sm.survey_id,
   survey_date,
   depth,
   site_code,
   site_name,
   full_name,
   block_num,
   method_id,
   obs.observable_item_id,
   observable_item_name,
   obs.measure_id,
   measure_name,
   obs.measure_value,
   observation_attribute::JSONB->>'DescriptiveName' AS "DescriptiveName",
   observation_attribute::JSONB->>'LegalSize' AS "LegalSize",
   observation_attribute::JSONB->>'SpeciesSex' AS "SpeciesSex",
   concat(observation_attribute::JSONB->>'SizeRaw', observation_attribute::JSONB->>'SizeClassRaw') AS "Size",
   CASE
       WHEN concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')='NoNo' THEN 'No'
       WHEN concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')='YesYes' THEN 'Yes'
       ELSE concat(observation_attribute::JSONB->>'SizeEstimated', observation_attribute::JSONB->>'SizeClassEstimated')
   END AS "Estimated",
observation_attribute::JSONB->>'Notes' AS "Notes"
FROM observation obs
LEFT JOIN observable_item_ref oir ON oir.observable_item_id = obs.observable_item_id
LEFT JOIN measure_ref mr ON mr.measure_id = obs.measure_id
LEFT JOIN survey_method sm ON obs.survey_method_id = sm.survey_method_id
LEFT JOIN survey s ON sm.survey_id = s.survey_id
LEFT JOIN diver_ref dr ON obs.diver_id = dr.diver_id
LEFT JOIN site_ref sr ON s.site_id = sr.site_id
WHERE (phylum='Chordata' and (concat(observation_attribute::JSONB->>'SizeRaw', observation_attribute::JSONB->>'SizeClassRaw') is not null))
OR  (CLASS='Cephalopoda'and (concat(observation_attribute::JSONB->>'SizeRaw', observation_attribute::JSONB->>'SizeClassRaw') is not null))
OR observation_attribute::JSONB->>'Notes' LIKE '%Added % from SpeciesID%'
OR observation_attribute::JSONB->>'Notes' LIKE '%SpeciesID Changed%'
OR observation_attribute::JSONB->>'Notes' LIKE '%Diver reference %'
OR observation_attribute::JSONB->>'Notes' LIKE '%Size data removed%'

bpasquer commented 1 year ago

Copy email from Lizzi 6 Apr 2023 for the record:

Thanks for sending this through. Initially I was a little confused at the 4 different columns of size outputs looking at the smallest “sizes” of “2” mapping to a measure name of “0.5cm” but now see that the size is in millimetres and measure name is the closest size bin in centimetres (and some of these smallest ones are clearly errors). I think it would be good to put that in the column headers and change the column names and order to be more user friendly – since we won’t be accessing them very often so there’s a high chance of forgetting how they are put together. Could the size columns be: size_raw (mm) and size_class (cm) (instead of measure name)? I also think for anyone using the file, then the measure_id and measure_value columns just add confusion. Unless you can see a reason to leave them in that I’m missing I think we should remove them. So, for the last columns in the file (columns 10 – 17), could they be: observable_item_name, description, size_class (cm), size_raw (mm), legal_size, size_estimated, species_sex, notes?

observable_item_name	measure_id	measure_name	measure_value	size	estimated
Petricia vernicina	53	0.5cm	1	2	No
Plagusia chabrus	53	0.5cm	1	2	No
Paguridae spp.	53	0.5cm	1	2	No
Carcinus maenas	53	0.5cm	2	2	Yes
Jasus edwardsii	53	0.5cm	2	2	Yes
Noumea closei	53	0.5cm	2	2	No
Noumea closei	53	0.5cm	2	2	No
Haliotis brazieri	53	0.5cm	95	2	Yes
Tosia australis	53	0.5cm	1	3	No
Dicathais orbita	53	0.5cm	1	3	No
Paguristes frontalis	53	0.5cm	1	3	No
Tosia australis	53	0.5cm	1	4	No
Phasianella australis	53	0.5cm	4	4	No
Aplysia gigantea	53	0.5cm	1	4	No
Haliotis roei	53	0.5cm	1	4	No
Pleuroploca australasia	53	0.5cm	4	4	No
Haliotis scalaris	53	0.5cm	1	5	No
Meridiastra calcar	53	0.5cm	1	5	No
Tosia australis	53	0.5cm	1	6	No

I also think it would be a good idea to have 2 files, one for the original size measurements as described above and then the rest of the “notes” and “descriptive names” (that do not have size info) in another file. Thanks again for all your work on this! Where will they be archived that we can access them?

bpasquer commented 1 year ago

For the record, this is the last feedback:

These files look great to me, thank you for getting that done 😊! My only suggestion would be to rename the ‘full_name’ column as ‘diver’ for clarity/consistency with our other endpoints.

bpasquer commented 1 year ago

The following files have been archived under archive/IMOS/NRMN:

Observation_attribute_size_information.zip which contains:
- Observation_Attribute_archive_Size_Information.csv
- Size_info_archive_query.sql
Observation_attribute_notes.zip which contains:
- Observation_Attribute_archive_Notes.csv
- Notes_archive_query.sql

aodn / nrmn-application

Correction of historical data -risk of metadata loss #1182