Open tucotuco opened 4 years ago
hasSex
and hasLifeStage
would be unusual (not impossible) for GBIF to add. Generally, GBIF flag things that can't be interpreted, but otherwise for fields that are interpreted, we either populate the field or nullify it (the verbatim values are always available). As you know we are working on open vocabularies for Sex
and LifeStage
.
Would it be acceptable to explore options where they are treated consistently with other fields, where sex
and life stage
are either populated or flagged as Not parsable
? The benefits being that the API remains intuitive to users.
I don't know about Vertnet itself (it may have its own needs) but upstream we can update the output format pretty easily.
These flags are just two among many. The use of flags like these is very intuitive, whereas the proposed alternative isn't quite so. So, at this point I would say it is not the same, nor could the same results be retrieved, but will discuss further in our group.
On Fri, Jun 26, 2020 at 11:51 AM rafe notifications@github.com wrote:
I don't know about Vertnet itself (it may have its own needs) but upstream we can update the output format pretty easily.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-650221061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ727O5W4Y3Z2P5SHWK6LRYSYWBANCNFSM4NNEFIPA .
From a different chat with @tucotuco:
The [more complete] list of "additional" index fields includes the following: 'keyname', 'haslicense', 'vntype', 'rank', 'mappable', 'hashid', 'hastypestatus', 'wascaptive', 'wasinvasive', 'hastissue', 'hasmedia', 'isfossil', 'haslength', 'haslifestage', 'hasmass', 'hassex', 'lengthinmm', 'lengthtype', 'massing', 'lengthunitsinferred', 'massunitsinferred', 'underivedlifestage', 'underivedsex', 'isarch'
And the core of the VertNet record-level harvest processing is here.. https://github.com/VertNet/post-harvest-processor/blob/master/lib/harvest_record_processor.py#L119
Hi all! I am mapping data to start writing an interpretation, so @tucotuco please correct me if I'm wrong:
1) isArchaeological
Original function
def is_archaeological(rec):
""" Check if a record represents an archaeological specimen.
parameters:
rec - dictionary to search (required)
returns:
True if the dictionary represents an archaeological specimen, otherwise False.
"""
if rec.has_key('networks'):
if 'arch' in rec['networks'].lower():
return 1
return 0
I didn't find networks or arch in a record fragment. The original record, I used isArchaeological on vertnet portal and GBIF Fragment
What data do I need to use from dwca record for isArchaeological flag? http://rs.tdwg.org/chrono/terms/ChronometricAge extension? institutionCode": "FLARCH"? GBIF registry data?
2) hasTissue
extensions": {
"http://data.ggbn.org/schemas/ggbn/terms/MaterialSample": [
{
"materialSampleType": "tissue"
}
]
3) hasSex
`http://rs.tdwg.org/dwc/terms/sex`
or
Use [sex_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/sex_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes
4) hasLifeStage
`http://rs.tdwg.org/dwc/terms/lifeStage`
or
Use [life_stage_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/life_stage_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes
5) hasLength
has **length**
6) hasWeight
has **mass**
7) length
Use [total_length_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/total_length_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes
8) mass
Use [body_mass_parser.py](https://github.com/VertNet/post-harvest-processor/blob/master/lib/trait_parsers/body_mass_parser.py) for a value in dwca dynamicProperties/dynamicProperties/fieldNotes
@muttcg In response to the numbered list in the previous comment:
@tucotuco Thank you,
If datasets contain only archeological records, it will possible to use GBIF Registry and add machine tag, but if datasets records are mixed it must be something inside a particular record, as you mentioned basisOfRecord (maybe custom term/etc)
Thanks, now I see preparations term, but The Denver Botanic Gardens' Tissue and DNA Bank records don't have preparation term, but have Preparation extension. Should the extension be used instead of the term in this case?
On Thu, Jan 14, 2021 at 9:10 AM Nikolay Volik notifications@github.com wrote:
@tucotuco https://github.com/tucotuco Thank you,
1.
If datasets contain only archeological records, it will possible to use GBIF Registry and add machine tag, but if datasets records are mixed it must be something inside a particular record, as you mentioned basisOfRecord (maybe custom term/etc) 2.
Thanks, now I see preparations term, but The Denver Botanic Gardens' Tissue and DNA Bank https://www.gbif.org/dataset/5e4a4046-2946-465d-b15a-29ad9d70238d records don't have preparation term, but have Preparation extension http://data.ggbn.org/schemas/ggbn/terms/Preparation. Should the extension be used instead of the term in this case?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-760157080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725TNSQ73N2DTBNIFLTSZ3NKPANCNFSM4NNEFIPA .
@MortenHofft I am planning to add this structure into ES schema:
"dynamicProperties": {
"type": "object",
"properties": {
"hasTissue": {"type": "boolean"},
"mass": {"type": ""},
....etc
}
},
I'm not sure what the plan is just from reading this issue.
Perhaps it is worth capturing more details, like:
sex
vs dynamicProperties.sex
vs MeasurementOrFacts.sex
), what flags will we add, what is the vocabularies if any.dynamicProperties
seem confusing since that field is already in the standard and is used for storing other data (as well) and secondly we already have a sex
field.API
We already have hasCoordinates
so I suppose hasSex
isn't all that alien. But we could also just ignore APIv1 for now and focus on getting it into the index and having it exposed in downloads and the hosted portals (both use the predicate query format - so sex:isNotNull
instead of hasSex
).
Index/response structure
I'm not keen on putting it all into a dynamicProperties
field.
sex
, lifeStage
: those seem to fit our existing dwc terms perfectly. Our normal approach is to put interpreted values there, possible adding a flag if it is derived. And if needed add an option to search verbatim values.lengthinmm
, lengthtype
and other fields without a dwcTerm equivalent. Those seem to fit well with MeasurementOrFacts
- could we start indexing that extension in general and transform properties like lengthinmm
deduced from other fields to that extension? Or if that isn't possible/appropriate, then group them - e.g. traits.lengthinmm
.
MeasurementOrFacts
to avoid unmanageable data volumes.@MortenHofft Would you like me to provide definitions for all of the terms in the list? Beyond that, is there anything else that I can provide at this point?
Thanks @tucotuco
@muttcg is currently porting the interpretations from python into Java, so we're able to process the data to the current VertNet spec. After this, we'll start looking at how that should best be surfaced in the Elasticsearch index, in the GraphQL API (powers the hosted portal will be the future GBIF.org), and also in the V1 API and downloads.
When we get to that stage we may have some questions but will probably start an issue on each feature so there is a clear thread and discussion to reference in the commits.
Thanks. We really appreciate the effort.
On Tue, Feb 2, 2021 at 7:56 AM Tim Robertson notifications@github.com wrote:
Thanks @tucotuco https://github.com/tucotuco
@muttcg https://github.com/muttcg is currently porting the interpretations from python into Java, so we're able to process the data to the current VertNet spec. After this, we'll start looking at how that should best be surfaced in the Elasticsearch index, in the GraphQL API (powers the hosted portal will be the future GBIF.org), and also in the V1 API and downloads.
When we get to that stage we may have some questions but will probably start an issue on each feature to there is a clear thread and discussion to reference in the commits.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/272#issuecomment-771551354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ722OS6B5PH2SXAEOY7TS47K4BANCNFSM4NNEFIPA .
Morten, Nik and I discussed this. This is the proposed result, but we'll continue the discussion on separate issues, so difficulties with one term don't block progress with others.
dwc:dynamicProperties
is difficult to parse, and very difficult for users to use. Indexing the length, tissue and weight will create demand from other publishers for a way to share their own measurement data, so we should support the preferred way — using DWC core terms and extensions — from the start, and parse dwc:dynamicProperties
as a fallback option.
We should take the same approach as we do for all GBIF interpretation: interpret data values to a common result (i.e. normalizing coordinate systems, metric units) to give comparable values across all occurrences.
isArchaeological
This is on hold, pending a new Basis of Record.
hasTissue
, hasSex
, hasLifeStage
, hasLength
, hasWeight
This type of query can already be satisfied using GBIF download predicates, e.g. { "type":"isNotNull", "parameter":"SEX" }
.
The VertNet hosted portal (i.e. GBIF react components) can make the appropriate GraphQL query for "has sex" filter, and pass on a suitable download predicate for GBIF downloads.
I don't think there's anything to be done in Pipelines for this, but there are small tasks for the hosted portal / web components.
NB possible implications of the FAQ on hasTissue
on VertNet: http://vertnet.org/resources/help.html#t-tab3
→ Morten will create an issue in gbif-web (if needed).
We already interpret dwc:sex
and dwc:lifeStage
. If these fields are not present, then we will inspect dwc:dynamicProperties
using the VertNet parser, and run the extracted value through interpretation. This will probably require some additional values to be added to our interpretation dictionaries.
→ Nik will create an issue for this in pipelines.
We can interpret dwc:preparations
, which is not currently in our index. It will require a new parser and a new vocabulary. Initially, we need only support the values used by VertNet.
If dwc:preparations
is empty, then we can look in dwc:dynamicProperties
.
Note again the hasTissue
FAQ on VertNet: http://vertnet.org/resources/help.html#t-tab3
→ Nik will create an issue for this in pipelines.
The preferred way to share these values will be through the MeasurementOrFact extension.
We will need a vocabulary and parser for mof:measurementType
(to begin with, with at least the values required for VertNet), mof:measurementUnit
(grams and metres? kilograms and metres? g and mm? Whichever, we'll need to handle a wide range of decimal values), mof:measurementValue
(interpreted according to the unit).
If the MeasurementOrFact
extension is not present, we can look into dwc:dynamicProperties
for data.
This is a larger task. We'll need an additional extension in DWCA downloads, and a way to specify query parameters/predicates using the extension (e.g. "parameter":"MEASUREMENT_OR_FACT:MEASUREMENT_TYPE"
or ...search?MeasurementOrFact:MeasurementType=LENGTH
, TBD).
→ Nik will create issues for this (pipelines + gbif-api).
@timrobertson100 @MattBlissett @MortenHofft I finished the initial version, simple version: 1) hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)
2) hasSex: es term sex exists (If DwcTerm.sex is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.sex parser) 3) hasLifeStage: es term lifeStage exists (If DwcTerm.lifeStage is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.lifeStage parser) 4) length and hasLength (pasre DwcTerm.dymanicProperties -> convert values to MeasurmentAndFacts extensaion -> add MeasurmentAndFacts (3 fields: DwcTerm.measurmentType, DwcTerm.measurmentvalue, DwcTerm.measurmentunit) array into index/hdfs):
5) mass and hasWeight (pasre DwcTerm.dymanicProperties -> convert values to MeasurmentAndFacts extensaion -> add MeasurmentAndFacts (3 fields: DwcTerm.measurmentType, DwcTerm.measurmentvalue, DwcTerm.measurmentunit) array into index/hdfs):
Questions: 1) length and hasLength -> should I use "length" instead of vertnet parser result "total length"? 2) mass and hasWeight -> shoud I use "mass" insted of vertnet parser result "total length", "head-body length", "fork length", "standard length", "snout-vent length"? 3) hasTissue -> does logic look ok?
- hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)
This is only setting interpreted dwc:preparations if verbatim dwc:preparations matches one of the VertNet tissue types ("tiss", "blood" etc).
We don't need that filter -- people might want to search for other preparations. We should split the value on |
etc, and store them in an array.
Later, we can interpret the values ("tiss" → "tissue" etc), but that requires a vocabulary first.
- hasSex: es term sex exists (If DwcTerm.sex is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.sex parser)
This looks fine.
- hasLifeStage: es term lifeStage exists (If DwcTerm.lifeStage is empty -> parse DwcTerm.dymanicProperties -> apply regular DwcTerm.lifeStage parser)
This also looks fine.
length and hasLength (parse DwcTerm.dynamicProperties -> convert values to MeasurementAndFacts extension -> add MeasurementAndFacts (3 fields: DwcTerm.measurementType, DwcTerm.measurementvalue, DwcTerm.measurementunit) array into index/hdfs):
length and hasLength -> should I use "length" instead of vertnet parser result "total length"?
mass and hasWeight -> should I use "mass" instead of vertnet parser result "total length", "head-body length", "fork length", "standard length", "snout-vent length"?
For the moment, with the data going no further than ES and Avro on HDFS, this is not a problem.
I don't know what our final (API, search, downloads) behaviour should be, and we need to decide that.
There is a lot going on in this thread. Some of it is of concern from the perspective of replicating VertNet capabilities. Is it best to wait for the issues to be separated out and then comment on them, or now before things proceed much further?
Thanks John; there are several overlapping topics, and I've tried to split them up.
Extracting data from dynamicProperties:
There's also the question about hasSex
and so on as query filters. Our proposal is to handle these like almost all existing terms in the GBIF API, i.e. to support queries for null, not null, and a list of values. I think that discussion can be continue here if necessary, or else on https://github.com/gbif/hp-vertnet-plus/
Thanks Matt. I have commented on each of those new issues.
I wholeheartedly support the proposal to handle hasX generically in the way you described. It increases the capabilities greatly and beyond the terms we targeted for VertNet.
@timrobertson100 asked me to post a list of data indexes we use in VertNet that are not possible yet in GBIF. The reasoning is that there are ever fewer reasons to maintain a portal based on technology other that used by GBIF. One of the biggest remaining reasons is that we support the following search indexes, and our community values these capabilities:
isArchaeological - indicates if the record is based on zooarchaeological or archaeobotanical material. This is something that could be solved by the introduction of a new basisOfRecord value.
hasTissue - indicates that the content of the preparations field can be interpreted to infer the existence of material sample(s) that can be used for DNA sequencing.
hasSex - indicates that a value has been given for the sex field.
hasLifeStage - indicates that a value has been given for the lifeStage field.
hasLength - indicates that the Traiter code (which processes occurrenceRemarks, dynamicProperties, and fieldNotes, https://github.com/rafelafrance/traiter/network/members) was able to extract a length measurement with units.
hasWeight - indicates that the Traiter code was able to extract a weight measurement with units.
length - search for a combination of length type (with controlled vocabulary coming from Traiter), and a range for values in mm
mass - search for values in a range from values extracted by Traiter.
Other important considerations are: 1) VertNet as a community has a strong identity, and it would therefore be very useful to instantiate a portal for VertNet that is a skin and filter on all of GBIF. 2) VertNet has had the luxury to implement innovations rapidly and it would be great to be able to continue to innovate in an agile scenario where data are on GBIF infrastructure. 3) VertNet has been very active in fostering data publishing in the zooarchaeological community, and these activities are on the cusp of exploding. 4) Along with 3) VertNet has developed and maintained (and will soon propose as a TDWG standard - Task Group nearly finished with activities) the ChronometricAge Extension. It would be of great value to be able to support indexing on ChronometricDate attributes.
There are many other issues of concern, but the purpose of creating this issue is mostly to lay out the kinds of terms that the VertNet community has come to value. VertNet colleagues merit mention: @rafelafrance, @dbloom, @melecoq, @pzermoglio, @robgur, @rondlg