gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Add interpretation for DWC term - preparations #474

Open muttcg opened 3 years ago

muttcg commented 3 years ago

As part of the VertNet feature we need to interpret preparation field, add it into index and hdfs schemas.

Use VertNet feature branch

MattBlissett commented 3 years ago

hasTissue - indicates that the content of the preparations field can be interpreted to infer the existence of material sample(s) that can be used for DNA sequencing.

We don't yet interpret the DWC preparations term, this issue is to add it.

VertNet's query "is there tissue that can be used for DNA sequencing" is then an OR-query on preparations for the normalized values VertNet uses: "TISSUE", "BLOOD" etc (not "tiss" as that will become "TISSUE"). (The query excludes preparations like "fossil" or "photograph".)

MattBlissett commented 3 years ago

Implementation in #477:

hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)

This is only setting interpreted dwc:preparations if verbatim dwc:preparations matches one of the VertNet tissue types ("tiss", "blood" etc).

We don't need that filter -- people might want to search for other preparations. We should split the value on | etc, and store them in an array.

Later, we can interpret the values ("tiss" → "tissue" etc), but that requires a vocabulary first.

tucotuco commented 3 years ago

I don't understand if this signifies that tissues will be fully detected or not. For example, if preparations contains, "piel | cranio | riñon" and a preparations vocabulary says that "riñon" is "liver", that will be great, but how will the fact that this signifies a viable tissue be accomplished. It will require the other vocabulary that says which standardized preparation clauses constitute tissues and then either add that to the list or use a different "hasTissue" index in the way we did. Just trying to confirm that the "hasTissue" concept will remain functional under the proposed scenario.

On Thu, Feb 11, 2021 at 4:39 PM Matt Blissett notifications@github.com wrote:

Implementation in #477 https://github.com/gbif/pipelines/pull/477:

hasTissue: es term preparation exists (Parse field DwcTerm.preparation -> if hasTissue -> add original value into index/hdfs)

This is only setting interpreted dwc:preparations if verbatim dwc:preparations matches one of the VertNet tissue types ("tiss", "blood" etc).

We don't need that filter -- people might want to search for other preparations. We should split the value on | etc, and store them in an array.

Later, we can interpret the values ("tiss" → "tissue" etc), but that requires a vocabulary first.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/474#issuecomment-777740566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72YBNIGSFMO4IK7H6DLS6QW6BANCNFSM4XJA522A .

timrobertson100 commented 3 years ago

My recommendation would be to start with adding preparations to the index as a multivalued field of Strings only (no interpretation other than splitting into the array of values). Once in operation, we can build a vocabulary for that to normalize the terms.

As a second, future step I propose we consider the VertNet wish for hasTissue in a broader sense. It may be something to implement, but I feel it would be more useful if we 1) categorize the types of evidence available to support the assertion of occurrence and 2) capture what means there are for a consumer to verify the identification. This could include whether a specimen exists, whether genetic material is available, what media are available that can be reviewed etc. Determining what evidence is available, and whether the identification can be verified is currently difficult as it's littered across DwC terms inconsistently.

dhobern commented 3 years ago

My approximate view is that hasTissue/hasPreparations is less important than stillExistsInSomePhysicalFormThatCouldInPrincipleBeExaminedOrStudiedOrSequencedEtc. In other words - is there a specimen or some part of a specimen that remains and can be studied?

Not sure if that helps in any way ...

tucotuco commented 3 years ago

Whereas I understand and applaud the broader vision, the need to be able to determine which specimens have viable tissues arose from a concrete need within the community actually sharing their data. We implemented it as an elegant and simple solution to a request to create a Tissues portal. For these reasons is was and remains an important capability of the VertNet portal that we would really like to replicate if we are going to abandon that for a hosted portal.

On Mon, Feb 15, 2021 at 2:02 AM Donald Hobern notifications@github.com wrote:

My approximate view is that hasTissue/hasPreparations is less important than stillExistsInSomePhysicalFormThatCouldInPrincipleBeExaminedOrStudiedOrSequencedEtc. In other words - is there a specimen or some part of a specimen that remains and can be studied?

Not sure if that helps in any way ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gbif/pipelines/issues/474#issuecomment-778939383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725JDXSU2OGYDSNPVKTS7CTEPANCNFSM4XJA522A .