gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Add interpretation for Measurement Or Facts extension #473

Open muttcg opened 3 years ago

muttcg commented 3 years ago

As part of the VertNet feature we need to interpret fields:

And add them into index and hdfs schemas

Use VertNet feature branch

MattBlissett commented 3 years ago

Interpretation and querying of length and mass

The preferred way to share these values will be through the MeasurementOrFact extension, and this should be implemented, and then the VertNet way (extract from dynamicProperties) routed to it. (There are around 9 million occurrences with a MeasurementOrFact extension.)

We will need a vocabulary and parser for mof:measurementType (to begin with, with at least the values required for VertNet), mof:measurementUnit (grams and metres? kilograms and metres? g and mm? Whichever, we'll need to handle a wide range of decimal values), mof:measurementValue (interpreted according to the unit).

If the MeasurementOrFact extension is not present, we can look into dwc:dynamicProperties for data.

This is a larger task. We'll need an additional extension in DWCA downloads, and a way to specify query parameters/predicates using the extension (e.g. "parameter":"MEASUREMENT_OR_FACT:MEASUREMENT_TYPE" or ...search?MeasurementOrFact:MeasurementType=LENGTH, TBD).

We need to decide what to do with values we can't parse (e.g. we can convert "5 inches" to cm/mm even if we don't know what it's measuring, and unknown measurement types probably still deserve to be shown on an occurrence page).

PR #477 already adds some support, including retrieving values from dynamicProperties and routing them to a MeasurementOrFact extension, but we need to decide on the API and general technical approach to interpreting additional extensions before implementing any more than this:

  1. length and hasLength (pasre DwcTerm.dynamicProperties → convert values to MeasurementAndFacts extension → add MeasurementAndFacts (3 fields: DwcTerm.measurementType, DwcTerm.measurementValue, DwcTerm.measurmentUnit) array into index/Avro):

A length query (non-API) can then be done with MeasurmentAndFacts.measurementType IN("total length", "head-body length", "fork length", "standard length", "snout-vent length") AND MeasurmentAndFacts.measurementValue == QUERY VALUE

A hasLength query is the same, but without the value.

  1. mass and hasWeight (pasre DwcTerm.dynamicProperties -> convert values to MeasurementAndFacts extension -> add MeasurementAndFacts (3 fields: DwcTerm.measurementType, DwcTerm.measurementvalue, DwcTerm.measurementUnit) array into index/Avro):

The query is similar, but uses the type "total weight".

MattBlissett commented 3 years ago

Querying using these parameters would get quite complicated:

gbifid measurementType measurementValue measurementUnit
1 TotalLength 1.4 metres
1 LegLength 0.4 metres
2 TotalLength 1.0 metres
2 LegLength 0.4 metres

It's not obvious how to query for things with TotalLength>1.2m, LegLength<0.6m etc. MeasurementOrFact:MeasurementType=LENGTH doesn't work, MeasurementOrFact:TotalLength=1.2, might.

tucotuco commented 3 years ago

If the MeasurementOrFact extension is not present, we can look into dwc:dynamicProperties for data.

I think you need to look into dwc:dynamicProperties (and occurrenceRemarks, and fieldNotes) even if there is a measurementOfFact extension. The measurementOrFacts extension might be included for other reasons than pulling out the kind of measurements we parsed for VertNet.

It may be of interest that the parser implemented for VertNet has been expanded greatly to extract a much broader set of trait data under the FuTRES project. Though VertNet does not have those capabilities in its production code base, it may be of great interest to pursue these broader trait extraction capabilities.

muttcg commented 3 years ago

Thanks, @tucotuco @MattBlissett Actually, the current version I made uses both MeasurementOrFact extension and dwc:dynamicProperties, dwc:dynamicProperties parses and reroutes to MeasurementOrFact extension, this part is done, but I don't interpret values in MeasurementOrFact extension after, just put raw values as a first step

tucotuco commented 3 years ago

I'm looking though a broad set of issues around the extraction of traits. What is the current status of accessing records with extracted traits in data downloads, snapshots, search API or gbif.org?

timrobertson100 commented 3 years ago

They are shown verbatim on occurrence records only. Search can be done for records having measurementsOrFacts They aren't included in downloads at the moment.

tucotuco commented 3 years ago

Thanks @timrobertson100 .