gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

New features requests from VertNet #272

Open tucotuco opened 4 years ago

tucotuco commented 4 years ago

@timrobertson100 asked me to post a list of data indexes we use in VertNet that are not possible yet in GBIF. The reasoning is that there are ever fewer reasons to maintain a portal based on technology other that used by GBIF. One of the biggest remaining reasons is that we support the following search indexes, and our community values these capabilities:

isArchaeological - indicates if the record is based on zooarchaeological or archaeobotanical material. This is something that could be solved by the introduction of a new basisOfRecord value.

hasTissue - indicates that the content of the preparations field can be interpreted to infer the existence of material sample(s) that can be used for DNA sequencing.

hasSex - indicates that a value has been given for the sex field.

hasLifeStage - indicates that a value has been given for the lifeStage field.

hasLength - indicates that the Traiter code (which processes occurrenceRemarks, dynamicProperties, and fieldNotes, https://github.com/rafelafrance/traiter/network/members) was able to extract a length measurement with units.

hasWeight - indicates that the Traiter code was able to extract a weight measurement with units.

length - search for a combination of length type (with controlled vocabulary coming from Traiter), and a range for values in mm

mass - search for values in a range from values extracted by Traiter.

Other important considerations are: 1) VertNet as a community has a strong identity, and it would therefore be very useful to instantiate a portal for VertNet that is a skin and filter on all of GBIF. 2) VertNet has had the luxury to implement innovations rapidly and it would be great to be able to continue to innovate in an agile scenario where data are on GBIF infrastructure. 3) VertNet has been very active in fostering data publishing in the zooarchaeological community, and these activities are on the cusp of exploding. 4) Along with 3) VertNet has developed and maintained (and will soon propose as a TDWG standard - Task Group nearly finished with activities) the ChronometricAge Extension. It would be of great value to be able to support indexing on ChronometricDate attributes.

There are many other issues of concern, but the purpose of creating this issue is mostly to lay out the kinds of terms that the VertNet community has come to value. VertNet colleagues merit mention: @rafelafrance, @dbloom, @melecoq, @pzermoglio, @robgur, @rondlg

timrobertson100 commented 4 years ago

hasSex and hasLifeStage would be unusual (not impossible) for GBIF to add. Generally, GBIF flag things that can't be interpreted, but otherwise for fields that are interpreted, we either populate the field or nullify it (the verbatim values are always available). As you know we are working on open vocabularies for Sex and LifeStage.

Would it be acceptable to explore options where they are treated consistently with other fields, where sex and life stage are either populated or flagged as Not parsable? The benefits being that the API remains intuitive to users.

rafelafrance commented 4 years ago

I don't know about Vertnet itself (it may have its own needs) but upstream we can update the output format pretty easily.

tucotuco commented 4 years ago

These flags are just two among many. The use of flags like these is very intuitive, whereas the proposed alternative isn't quite so. So, at this point I would say it is not the same, nor could the same results be retrieved, but will discuss further in our group.

On Fri, Jun 26, 2020 at 11:51 AM rafe notifications@github.com wrote:

I don't know about Vertnet itself (it may have its own needs) but upstream we can update the output format pretty easily.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-650221061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727O5W4Y3Z2P5SHWK6LRYSYWBANCNFSM4NNEFIPA .

timrobertson100 commented 4 years ago

From a different chat with @tucotuco:

The [more complete] list of "additional" index fields includes the following: 'keyname', 'haslicense', 'vntype', 'rank', 'mappable', 'hashid', 'hastypestatus', 'wascaptive', 'wasinvasive', 'hastissue', 'hasmedia', 'isfossil', 'haslength', 'haslifestage', 'hasmass', 'hassex', 'lengthinmm', 'lengthtype', 'massing', 'lengthunitsinferred', 'massunitsinferred', 'underivedlifestage', 'underivedsex', 'isarch'

tucotuco commented 4 years ago

And the core of the VertNet record-level harvest processing is here.. https://github.com/VertNet/post-harvest-processor/blob/master/lib/harvest_record_processor.py#L119

muttcg commented 3 years ago

Hi all! I am mapping data to start writing an interpretation, so @tucotuco please correct me if I'm wrong:

1) isArchaeological

Original function

    def is_archaeological(rec):
        """ Check if a record represents an archaeological specimen.
        parameters:
            rec - dictionary to search (required)
        returns:
            True if the dictionary represents an archaeological specimen, otherwise False.
        """
        if rec.has_key('networks'):
            if 'arch' in rec['networks'].lower():
                return 1
        return 0

I didn't find networks or arch in a record fragment. The original record, I used isArchaeological on vertnet portal and GBIF Fragment

What data do I need to use from dwca record for isArchaeological flag? http://rs.tdwg.org/chrono/terms/ChronometricAge extension? institutionCode": "FLARCH"? GBIF registry data?

2) hasTissue

    extensions": {
        "http://data.ggbn.org/schemas/ggbn/terms/MaterialSample": [
            {
                "materialSampleType": "tissue"
            }
        ]

3) hasSex

`http://rs.tdwg.org/dwc/terms/sex` 
or
Use [sex_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/sex_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes

4) hasLifeStage

`http://rs.tdwg.org/dwc/terms/lifeStage`
or
Use [life_stage_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/life_stage_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes

5) hasLength

has **length**

6) hasWeight

has **mass**

7) length

Use [total_length_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/total_length_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes

8) mass

Use [body_mass_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/body_mass_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes
tucotuco commented 3 years ago

@muttcg In response to the numbered list in the previous comment:

  1. "networks" is a string in the VertNet registry that contains our list of the "networks" the dataset belongs to. It contains multiple keys in the string from among these: iDigBio, VertNet, MaNIS, ORNIS, HerpNET, FishNet, ZooarchNet. GBIF isn't in there because they are all destined for GBIF. So, we look for "arch" anticipating other archaeo initiatives. Not sure how we can mimic this. The Chrono extension will not be indicative, because all of Paleo could potentially use that. We could conceivably pass you the networks value in a records we publish, but that won't be scalable as ZooarchNet is working on growing and becoming more independent. The basisOfRecord vocabulary could be expanded to account for this, nd this might be the best solution, but that will take quite a long time. Open to GBIF registry suggestions.
  2. hasTissue is not at all well represented by having the GGBN extension. We determine this information by detecting indicative strings in the preparations field. The code is here. 3 - 8 all use the parsers, as you have noted.
muttcg commented 3 years ago

@tucotuco Thank you,

  1. If datasets contain only archeological records, it will possible to use GBIF Registry and add machine tag, but if datasets records are mixed it must be something inside a particular record, as you mentioned basisOfRecord (maybe custom term/etc)

  2. Thanks, now I see preparations term, but The Denver Botanic Gardens' Tissue and DNA Bank records don't have preparation term, but have Preparation extension. Should the extension be used instead of the term in this case?

tucotuco commented 3 years ago
  1. I'll try to initiate a call for the addition of a new BasisOfRecord term.
  2. The existence of a preparation extension does not indicate the existence of tissues, so let's leave that out.

On Thu, Jan 14, 2021 at 9:10 AM Nikolay Volik notifications@github.com wrote:

@tucotuco https://github.com/tucotuco Thank you,

1.

If datasets contain only archeological records, it will possible to use GBIF Registry and add machine tag, but if datasets records are mixed it must be something inside a particular record, as you mentioned basisOfRecord (maybe custom term/etc) 2.

Thanks, now I see preparations term, but The Denver Botanic Gardens' Tissue and DNA Bank https://www.gbif.org/dataset/5e4a4046-2946-465d-b15a-29ad9d70238d records don't have preparation term, but have Preparation extension http://data.ggbn.org/schemas/ggbn/terms/Preparation. Should the extension be used instead of the term in this case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-760157080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725TNSQ73N2DTBNIFLTSZ3NKPANCNFSM4NNEFIPA .

muttcg commented 3 years ago

@MortenHofft I am planning to add this structure into ES schema:

"dynamicProperties": {
  "type": "object",
  "properties": {
    "hasTissue": {"type": "boolean"},
    "mass": {"type": ""},
    ....etc
  }
},
MortenHofft commented 3 years ago

Clarification

I'm not sure what the plan is just from reading this issue.

Perhaps it is worth capturing more details, like:

Initial thoughts

API We already have hasCoordinates so I suppose hasSex isn't all that alien. But we could also just ignore APIv1 for now and focus on getting it into the index and having it exposed in downloads and the hosted portals (both use the predicate query format - so sex:isNotNull instead of hasSex).

Index/response structure I'm not keen on putting it all into a dynamicProperties field.

tucotuco commented 3 years ago

@MortenHofft Would you like me to provide definitions for all of the terms in the list? Beyond that, is there anything else that I can provide at this point?

timrobertson100 commented 3 years ago

Thanks @tucotuco

@muttcg is currently porting the interpretations from python into Java, so we're able to process the data to the current VertNet spec. After this, we'll start looking at how that should best be surfaced in the Elasticsearch index, in the GraphQL API (powers the hosted portal will be the future GBIF.org), and also in the V1 API and downloads.

When we get to that stage we may have some questions but will probably start an issue on each feature so there is a clear thread and discussion to reference in the commits.

tucotuco commented 3 years ago

Thanks. We really appreciate the effort.

On Tue, Feb 2, 2021 at 7:56 AM Tim Robertson notifications@github.com wrote:

Thanks @tucotuco https://github.com/tucotuco

@muttcg https://github.com/muttcg is currently porting the interpretations from python into Java, so we're able to process the data to the current VertNet spec. After this, we'll start looking at how that should best be surfaced in the Elasticsearch index, in the GraphQL API (powers the hosted portal will be the future GBIF.org), and also in the V1 API and downloads.

When we get to that stage we may have some questions but will probably start an issue on each feature to there is a clear thread and discussion to reference in the commits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-771551354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ722OS6B5PH2SXAEOY7TS47K4BANCNFSM4NNEFIPA .

MattBlissett commented 3 years ago

Morten, Nik and I discussed this. This is the proposed result, but we'll continue the discussion on separate issues, so difficulties with one term don't block progress with others.

dwc:dynamicProperties is difficult to parse, and very difficult for users to use. Indexing the length, tissue and weight will create demand from other publishers for a way to share their own measurement data, so we should support the preferred way — using DWC core terms and extensions — from the start, and parse dwc:dynamicProperties as a fallback option.

We should take the same approach as we do for all GBIF interpretation: interpret data values to a common result (i.e. normalizing coordinate systems, metric units) to give comparable values across all occurrences.

isArchaeological

This is on hold, pending a new Basis of Record.

Query parameters hasTissue, hasSex, hasLifeStage, hasLength, hasWeight

This type of query can already be satisfied using GBIF download predicates, e.g. { "type":"isNotNull", "parameter":"SEX" }.

The VertNet hosted portal (i.e. GBIF react components) can make the appropriate GraphQL query for "has sex" filter, and pass on a suitable download predicate for GBIF downloads.

I don't think there's anything to be done in Pipelines for this, but there are small tasks for the hosted portal / web components.

NB possible implications of the FAQ on hasTissue on VertNet: http://vertnet.org/resources/help.html#t-tab3

→ Morten will create an issue in gbif-web (if needed).

Interpretation and querying of sex, lifeStage

We already interpret dwc:sex and dwc:lifeStage. If these fields are not present, then we will inspect dwc:dynamicProperties using the VertNet parser, and run the extracted value through interpretation. This will probably require some additional values to be added to our interpretation dictionaries.

→ Nik will create an issue for this in pipelines.

Interpretation and querying of tissue

We can interpret dwc:preparations, which is not currently in our index. It will require a new parser and a new vocabulary. Initially, we need only support the values used by VertNet.

If dwc:preparations is empty, then we can look in dwc:dynamicProperties.

Note again the hasTissue FAQ on VertNet: http://vertnet.org/resources/help.html#t-tab3

→ Nik will create an issue for this in pipelines.

Interpretation and querying of length and mass

The preferred way to share these values will be through the MeasurementOrFact extension.

We will need a vocabulary and parser for mof:measurementType (to begin with, with at least the values required for VertNet), mof:measurementUnit (grams and metres? kilograms and metres? g and mm? Whichever, we'll need to handle a wide range of decimal values), mof:measurementValue (interpreted according to the unit).

If the MeasurementOrFact extension is not present, we can look into dwc:dynamicProperties for data.

This is a larger task. We'll need an additional extension in DWCA downloads, and a way to specify query parameters/predicates using the extension (e.g. "parameter":"MEASUREMENT_OR_FACT:MEASUREMENT_TYPE" or ...search?MeasurementOrFact:MeasurementType=LENGTH, TBD).

→ Nik will create issues for this (pipelines + gbif-api).

muttcg commented 3 years ago

@timrobertson100 @MattBlissett @MortenHofft I finished the initial version, simple version: 1) hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)

2) hasSex: es term sex exists (If DwcTerm.sex is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.sex parser) 3) hasLifeStage: es term lifeStage exists (If DwcTerm.lifeStage is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.lifeStage parser) 4) length and hasLength (pasre DwcTerm.dymanicProperties -> convert values to MeasurmentAndFacts extensaion -> add MeasurmentAndFacts (3 fields: DwcTerm.measurmentType, DwcTerm.measurmentvalue, DwcTerm.measurmentunit) array into index/hdfs):

5) mass and hasWeight (pasre DwcTerm.dymanicProperties -> convert values to MeasurmentAndFacts extensaion -> add MeasurmentAndFacts (3 fields: DwcTerm.measurmentType, DwcTerm.measurmentvalue, DwcTerm.measurmentunit) array into index/hdfs):

Questions: 1) length and hasLength -> should I use "length" instead of vertnet parser result "total length"? 2) mass and hasWeight -> shoud I use "mass" insted of vertnet parser result "total length", "head-body length", "fork length", "standard length", "snout-vent length"? 3) hasTissue -> does logic look ok?

MattBlissett commented 3 years ago
  1. hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)

This is only setting interpreted dwc:preparations if verbatim dwc:preparations matches one of the VertNet tissue types ("tiss", "blood" etc).

We don't need that filter -- people might want to search for other preparations. We should split the value on | etc, and store them in an array.

Later, we can interpret the values ("tiss" → "tissue" etc), but that requires a vocabulary first.

  1. hasSex: es term sex exists (If DwcTerm.sex is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.sex parser)

This looks fine.

  1. hasLifeStage: es term lifeStage exists (If DwcTerm.lifeStage is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.lifeStage parser)

This also looks fine.

  1. length and hasLength (parse DwcTerm.dynamicProperties -> convert values to MeasurementAndFacts extension -> add MeasurementAndFacts (3 fields: DwcTerm.measurementType, DwcTerm.measurementvalue, DwcTerm.measurementunit) array into index/hdfs):

    1. length and hasLength -> should I use "length" instead of vertnet parser result "total length"?

    2. mass and hasWeight -> should I use "mass" instead of vertnet parser result "total length", "head-body length", "fork length", "standard length", "snout-vent length"?

For the moment, with the data going no further than ES and Avro on HDFS, this is not a problem.

I don't know what our final (API, search, downloads) behaviour should be, and we need to decide that.

tucotuco commented 3 years ago

There is a lot going on in this thread. Some of it is of concern from the perspective of replicating VertNet capabilities. Is it best to wait for the issues to be separated out and then comment on them, or now before things proceed much further?

MattBlissett commented 3 years ago

Thanks John; there are several overlapping topics, and I've tried to split them up.

Extracting data from dynamicProperties:


There's also the question about hasSex and so on as query filters. Our proposal is to handle these like almost all existing terms in the GBIF API, i.e. to support queries for null, not null, and a list of values. I think that discussion can be continue here if necessary, or else on https://github.com/gbif/hp-vertnet-plus/

tucotuco commented 3 years ago

Thanks Matt. I have commented on each of those new issues.

I wholeheartedly support the proposal to handle hasX generically in the way you described. It increases the capabilities greatly and beyond the terms we targeted for VertNet.