aodn / nrmn-application

A web application for collation, validation, and storage of all data obtained during surveys conducted by the NRMN
GNU General Public License v3.0
4 stars 3 forks source link

Correction of historical data -risk of metadata loss #1182

Closed bpasquer closed 1 year ago

bpasquer commented 1 year ago

Historical observation have sometimes attributes attached. These attributes are stored in the jsonb fields observation_attribute:

select jsonb_object_keys("observation_attribute"), count(*)
from observation
group by jsonb_object_keys("observation_attribute")
jsonb_object_keys count
Notes 4217
Biomass 1137406
SizeRaw 15578
DescriptiveName 80791
SizeClassRaw 26842
LegalSize 150
SizeEstimated 15596
SpeciesSex 32058
SimulatedAbsence 300
SizeClassEstimated 26864

Similarly, some metadata are stored in the survey_method table:

select 'observation', jsonb_object_keys("survey_method_attribute"), count(*)
from survey_method
group by jsonb_object_keys("survey_method_attribute")
jsonb_object_keys count
NonStandardData 255
LegacyMethod 1808

There is a risk of metadata loss when historical data are corrected. How can we minimize the risk?

LizziOh commented 1 year ago

@atcooper1 we can see the details of these observation traits with this code (change 'Notes' to the other attributes to see details): select distinct(observation_attribute->'Notes') from nrmn.observation where observation_attribute->'Notes' is not null

Most attributes relate to fine scale sizes, whether the animal (eg. lobster) was estimated for size, and its sex (also used for wrasse historically). However these seem to be duplicated, eg. 'SizeRaw' and 'SizeClassRaw' where I cannot tell the difference, perhaps the same thing from the 2 historical databases? 'Notes' mostly refer to RLS species ID changes - I'm unsure the value of these. 'Biomass' may be redundant - as this is now calculated on data-out based on the stored a's and b's??? 'DescriptiveName' is a mystery to me and I'll have to look into it further.

Moving forward the metrics related to lobster/abalone raw sizes, sex and estimates could be moved to a legacy endpoint, eg incorporated into the ep_lobster_haliotis and no longer attached to observation_ids (but still attached to survey_ids etc.).

LizziOh commented 1 year ago

There has been confirmation that loss of these historical data is acceptable in the database and legacy data of these can be generated and available externally. Record of this disscussion: RE_ Database corrections mechanism - potential loss of observation-level metadata.pdf

File generated to demonstrate and discuss different examples of the metadata at risk: Observation attributes json.xlsx

bpasquer commented 1 year ago

A version1 extract of

were shared, and Lizzie's and Toni (in italic) feedbacks were as follows:

Thank you for the extracts. For the basic format I think that we should include survey_date, depth, site_code, site_date and diver as columns.

For the legacy_observation_attributes, I think they require some cleaning.

Biomass should potentially be excluded as it is incomplete – what do you think Toni? This would allow the extract to be much shorter and focussed on notes, animal sex, and high res size categories. Agreed, get rid of biomass The various columns for sizes should be merged into a meaningful variable. Currently they are very convoluted and confusing and in some years will mean nothing to whomever is looking at the data without proper metadata descriptions.

  • Size estimated and size class estimated should be merged as a singular yes/no field
  • Sizeraw is a value in mm relating to invertebrate observations and should be kept as is.
  • Size class estimated is a weird mix of measurements in inches for fish and cephalopods, but measurements in mm for invertebrates. The fish ones should be deleted and disregarded. The invertebrate measurements should be kept and merged with “Sizeraw” “simulated absence” is a no species found record. These should be deleted as is this not preserved/re-ingested upon corrections??
  • Notes should stay as is, but may need some review / discussion with Toni as to whether to keep or separate the automated RLS ones referring to species and size corrections. Looking at the RLS notes, we think it’s probably best to remove these from the legacy data all together. They are pretty meaningless now that Species ID has changed in the NRMN DB, and we really don’t need to know that diver initials were updated with accents! Suggest removing all Notes %Added # from SpeciesID%, %SpeciesID Changed%, %Diver reference data%, and %Size data removed%.

For the survey_method attributes, I feel like it is better to convert these to survey_notes stored at the survey level as this is where they are relevant (and seem to be repeated for every element of the survey-method in the survey_id’s anyway. This way the extract is redundant, and we probably don’t need it at all. There are only 5 types of non-standard data attributes, all applicable at the survey level:

  • Site sampled due to oil spill
  • Method error
  • Poor visibility
  • Additional data
  • Carried out on seagrass bed

“Method error” are for surveys from one day and seem to be explained in the survey notes anyway - but those notes can still be appended with “method error”. NB. All of these should not override existing survey_notes but be appended to them. Poor visibility seems to be subjective and probably inconsistent so I would probably exclude these unless the visibility value is missing for that survey.

For the legacy method the two attribute values are 8 and 9. These stand for the different legacy methods described below. These data were converted to the standardised fish blocks when other ATRC data were merged and block abundances were simulated. They could be added as survey note descriptions as well (eg. “Legacy method 8: Parallel fish survey 50 x 10 m blocks over 4 surveys”) - this may be useful if ever checking the original survey sheets, but the downside is that the survey notes might confuse people. Do you have an opinion, Toni? I tend to think just delete them.

| 8 | Parallel fish survey 50 x 10 m blocks | e.g. Jervis Bay 2007 4 transects: 1,2,3,4 | Total: each fish species Size: all species; estimated in inch size categories Sex: when available | | 9 | Parallel fish survey 100 x 5 m blocks | e.g. Batemans Bay 2005, 2006 4 transects: 1,2,3,4 | Total: each fish species Size: all species; estimated in inch size categories Sex: when available |

Thanks team,

bpasquer commented 1 year ago

The following query was used to clean and extract the observation attributes( I will share the resulting attribute list in an email)

The query :

bpasquer commented 1 year ago

Copy email from Lizzi 6 Apr 2023 for the record:

Thanks for sending this through.  Initially I was a little confused at the 4 different columns of size outputs looking at  the smallest “sizes” of “2” mapping to a measure name of “0.5cm” but now see that the size is in millimetres and measure name is the closest size bin in centimetres (and some of these smallest ones are clearly errors).  I think it would be good to put that in the column headers and change the column names and order to be more user friendly – since we won’t be accessing them very often so there’s a high chance of forgetting how they are put together.  Could the size columns be:  size_raw (mm) and size_class (cm) (instead of measure name)?  I also think for anyone using the file, then the measure_id and measure_value columns just add confusion.  Unless you can see a reason to leave them in that I’m missing I think we should remove them.  So, for the last columns in the file (columns 10 – 17), could they be: observable_item_name, description, size_class (cm), size_raw (mm), legal_size, size_estimated, species_sex, notes?

observable_item_name measure_id measure_name measure_value DescriptiveName LegalSize SpeciesSex size estimated
Petricia vernicina 53 0.5cm 1       2 No
Plagusia chabrus 53 0.5cm 1       2 No
Paguridae spp. 53 0.5cm 1       2 No
Carcinus maenas 53 0.5cm 2       2 Yes
Jasus edwardsii 53 0.5cm 2       2 Yes
Noumea closei 53 0.5cm 2       2 No
Noumea closei 53 0.5cm 2       2 No
Haliotis brazieri 53 0.5cm 95       2 Yes
Tosia australis 53 0.5cm 1       3 No
Dicathais orbita 53 0.5cm 1       3 No
Paguristes frontalis 53 0.5cm 1       3 No
Tosia australis 53 0.5cm 1       4 No
Phasianella australis 53 0.5cm 4       4 No
Aplysia gigantea 53 0.5cm 1       4 No
Haliotis roei 53 0.5cm 1       4 No
Pleuroploca australasia 53 0.5cm 4       4 No
Haliotis scalaris 53 0.5cm 1       5 No
Meridiastra calcar 53 0.5cm 1       5 No
Tosia australis 53 0.5cm 1       6 No

I also think it would be a good idea to have 2 files, one for the original size measurements as described above and then the rest of the “notes” and “descriptive names” (that do not have size info) in another file. Thanks again for all your work on this! Where will they be archived that we can access them?

bpasquer commented 1 year ago

For the record, this is the last feedback:

These files look great to me, thank you for getting that done 😊! My only suggestion would be to rename the ‘full_name’ column as ‘diver’ for clarity/consistency with our other endpoints.

bpasquer commented 1 year ago

The following files have been archived under archive/IMOS/NRMN: