Semantic annotation - Githubissues

clnsmth commented 5 years ago

What are the different ways in which EMLassemblyline can integrate with semantic annotation and what are the respective strengths and weaknesses of each?

earnaud commented 4 years ago

Hi ! With my colleague ( @yvanlebras ), we had some advances in some advanced metadata tools while we participated to the Ocean Hackathon 2019. These advances are viewable in this git: https://github.com/pole-national-donnees-biodiversite/OB1.metadata . A beginning of Shiny app allows to input files, detect a wide bunch of metadata fields, and browse ontology terms. I am also investigating what is done within EML and emld packages' teams to have some clues on how semantics could be integrated in EAL.

clnsmth commented 4 years ago

Thanks for looking into this @earnaud.

clnsmth commented 4 years ago

Implementation option 1 ...

The workflow:

Template all metadata (i.e. the normal EMLassemblyline process stopping at make_eml()).
Run template_annotations() to gather all EML elements within the templates that can be annotated and write to a long table pre-populated with predicate and object labels and URIs (to the fullest extent possible).
Complete the annotation.txt template.
Run make_eml()

All annotations will be listed under the /eml/annotations element, which is a simpler implementation than placing them directly under the corresponding subject elements.

Some benefits:

template_annotations() is an optional step in the EMLassemblyline process rather than refactoring of the existing workflow.
The long format of annotations.txt makes it easy for users to add annotations.
Some existing templates are a mix of elements that can and cannot be annotated, so trying to squeeze in annotation fields would require row specific logic that may confuse the user (e.g. personnel.txt).
We could implement an option to annotate EML 2.1 records where template_annotations() reads the EML, writes the annotatable elements to annotations.txt, reads the user completed template, then runs make_eml() with a new code block that inserts the annotations and writes to .xml.

Some issues:

Contextual information is lost by annotating the subjects outside of their parent template. However, this could be mitigated by creating a “context” field alongside the element IDs (likely UUIDs) with values that are a composite of the corresponding element and value (for folks that are familiar with EML), or some translated version thereof (for those not familiar with EML, which is what we likely want to support).

atn38 commented 4 years ago

It's been a while since I've used EMLassembly line, but I'd advocate for placing annotations in context, especially at the attribute level. Is it possible/feasible to amend the attribute table template to include annotations? e.g. add two new columns for annotation label and URI, while the object property is assumed to be "isAMeasurementOf". You only get one annotation per attribute this way, but perhaps that's enough.

On Tue, Jan 21, 2020 at 1:52 PM Colin Smith notifications@github.com wrote:

Implementation option 1 ...

The workflow:

Template all metadata (i.e. the normal EMLassemblyline process stopping at make_eml()).

Run template_annotations() to gather all EML elements within the templates that can be annotated and write to a long table pre-populated with predicate and object labels and URIs (to the fullest extent possible).

Complete the annotation.txt template.

Run make_eml()

All annotations will be listed under the /eml/annotations element, which is a simpler implementation than placing them directly under the corresponding subject elements.

Some benefits:

template_annotations() is an optional step in the EMLassemblyline process rather than refactoring of the existing workflow.

The long format of annotations.txt makes it easy for users to add annotations.

Some existing templates are a mix of elements that can and cannot be annotated, so trying to squeeze in annotation fields would require row specific logic that may confuse the user (e.g. personnel.txt).

We could implement an option to annotate EML 2.1 records where template_annotations() reads the EML, writes the annotatable elements to annotations.txt, reads the user completed template, then runs make_eml() with a new code block that inserts the annotations and writes to .xml.

Some issues:

Contextual information is lost by annotating the subjects outside of their parent template. However, this could be mitigated by creating a “context” field alongside the element IDs (likely UUIDs) with values that are a composite of the corresponding element and value (for folks that are familiar with EML), or some translated version thereof (for those not familiar with EML, which is what we likely want to support).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EDIorg/EMLassemblyline/issues/31?email_source=notifications&email_token=AKAZD5SCPSYYKTW42GO4HHDQ65G7DA5CNFSM4IXFGE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJRA4EI#issuecomment-576851473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKAZD5XDAFWKAA7DW3525JDQ65G7DANCNFSM4IXFGE3A .

yvanlebras commented 4 years ago

Hi Colin, everyone! It seems to me of interest to writes annotations on a separate annotation.txt file but keeping the context. In our case, we will provide a Shiny app to create such links, so this is not so important (if I am not wrong @earnaud ;) ) if this EMLAL functionality is not directly linked to a specific context as attributes level as we will provide a user interface where describing the attrributes, the user can also add a semantic annotation (and definitely yes, this semantic annotation is particularly of interest for attributes level). But if users want to use EMLAL from R command line, I understand that it can be quite difficult... Is it a way in EMLAL to have an hybrid situation (ie having this separate annotation.txt file but filled by previous EMLAL steps ?)

earnaud commented 4 years ago

Hi all, I think I like the way you offer to include annotations in templates, @clnsmth . In my experience with MetaShARK, I end up with a list structured with these levels: 1/ root, 2/ EAL modules and 3/ EML contents. The 3rd level is generally composed of a vector or table, and it sounds easy to me to have a facultative last element (item or column) which could be organized as follow: ontology::term. If this element is empty, which is easily checkible (checkable?), make_eml() could ignore it. Through MetaShARK, it would be like a minor update to add an annotation module in each EAL step.

clnsmth commented 4 years ago

It's been a while since I've used EMLassembly line, but I'd advocate for placing annotations in context, especially at the attribute level. Is it possible/feasible to amend the attribute table template to include annotations? e.g. add two new columns for annotation label and URI, while the object property is assumed to be "isAMeasurementOf". You only get one annotation per attribute this way, but perhaps that's enough.

Thanks for these comments @atn38. Yes, I totally agree with the importance of retaining context and yes, additional fields could be added to the attributes template to accommodate annotation. However, as you mention this would only enable one annotation per attribute which is a restriction we'll have to weigh against the long table implementation.

Another benefit of a single annotations template is that the user would know exactly where to go for reading and editing this content. Not all elements support annotation and the corresponding mix among templates may cause some annotatable elements to be over looked. Additionally, the logic issue of point 3 in the benefits listed above is problematic.

Hi Colin, everyone! It seems to me of interest to writes annotations on a separate annotation.txt file but keeping the context. In our case, we will provide a Shiny app to create such links, so this is not so important (if I am not wrong @earnaud ;) ) if this EMLAL functionality is not directly linked to a specific context as attributes level as we will provide a user interface where describing the attrributes, the user can also add a semantic annotation (and definitely yes, this semantic annotation is particularly of interest for attributes level). But if users want to use EMLAL from R command line, I understand that it can be quite difficult... Is it a way in EMLAL to have an hybrid situation (ie having this separate annotation.txt file but filled by previous EMLAL steps ?)

Thanks for this update and questions @yvanlebras. Yes, the annotation.txt file or the corresponding data frame could be created independently of a template_attributes() function, thereby supporting annotation through your Shiny app.

I think I like the way you offer to include annotations in templates, @clnsmth . In my experience with MetaShARK, I end up with a list structured with these levels: 1/ root, 2/ EAL modules and 3/ EML contents. The 3rd level is generally composed of a vector or table, and it sounds easy to me to have a facultative last element (item or column) which could be organized as follow: ontology::term. If this element is empty, which is easily checkible (checkable?), make_eml() could ignore it. Through MetaShARK, it would be like a minor update to add an annotation module in each EAL step.

This all sounds great @earnaud! We can add a check within make_eml() to handle empty/non-empty elements, or write some adapter code to translate these contents to an annotations.txt template.

A clarifying question: Will the single annotations.txt template work with MetaShARK? It sounds like it could but want to make sure.

clnsmth commented 4 years ago

The annotations.txt template would be a long table with the fields:

id Identifier of the subject element that is unique within the scope of the EML record.
context Contextual information to help the user understand which element they are annotating.
predicate_label The human understandable predicate label (e.g. "is about")
predicate_uri The machine dereferenceable predicate URI (e.g. "http://purl.obolibrary.org/obo/IAO_0000136")
object_label The human understandable object label (e.g. "Mammalia")
object_uri The machine dereferenceable object URI (e.g. "http://purl.obolibrary.org/obo/NCBITaxon_40674")

mobb commented 4 years ago

Before you go too much further in planning code for EML assembly line, it would be a good idea to see what others have learned about annotating EML and plan what use cases EMLassemblyLine should support. The work I know about is by the Arctic Data Center (https://arcticdata.io/) has annotated 100s of datasets. See this webinar for an overview: https://www.dataone.org/webinars/fair%E2%80%99er-data-through-semantics-nsf%E2%80%99s-dataone-and-arctic-data-center

The materials (code, spreadsheets, process notes) for this are currently in a private git repo; I am working with the owners to get it more public.

earnaud commented 4 years ago

Hi, about data packages releases, I must underline the - fantastic - work of one of our temporary workers concerning the french National Biodiversity Data Hub (see https://openstack-192-168-100-101.genouest.org/metacatui in french), who described 60+ datasets with EML Assembly Line. Also, I do not see any problem furherly planning code for the EAL since its functions are up-to-date with actual EML 2.2.0 features (at this point). The only problem is the support of non-tabular files (and we're thinking about it: https://github.com/pole-national-donnees-biodiversite/OB1.metadata)

I apologize by advance because these links are in french.

clnsmth commented 4 years ago

Agreed @mobb. The base level implementation will only focus on annotating new EML created in the EMLassemblyline process. Annotating existing EML will definitely be informed by the expertise and experience of the Arctic Data Center.

Agreed @earnaud. We'll continue implementing annotation within EMLassemblyline and work on the problem of non-tabular files. And ... Google Translate fait un travail fantastique pour combler les langues utilisées par nos communautés respectives : )

clnsmth commented 4 years ago

OK folks, a working version of this enhancement is available on branch fix_31b. Comments are welcome and much appreciated!

remotes::install_github("EDIorg/EMLassemblyline", ref = "fix_31b")

The implementation supports two use cases:

New EML ... created by the EMLassemblyline workflow

Complete all metadata templates for your dataset (as usual)
Run template_annotations() to create the annotations template
- The annotations template (annotations.txt) reports the annotatable elements within your metadata and assigns default predicate annotations. You’ll have to add object annotations from ontologies of your choosing. You can remove annotations by deleting rows and add annotations by copying a subject's row, pasting it to a new line, then modifying the object annotation fields.
- Default annotations can be changed by the user
- Instructions for creating annotations.txt from scratch are included in the function docs (for users gathering annotations in other ways).
- Recurring nodes (e.g. ResponsibleParty) only require one set of annotations within annotations.txt
Run make_eml()

Old EML ... created in other ways

Run template_annotations() for your EML file
Run annotate_eml() to get an annotated revision of your EML file

Note: All annotated elements are assigned ids and their annotations are placed both immediately under the parent element (subject) and within the /eml/annotations node through id+reference pairs. This redundant approach supports variation in where EML metadata consumers prefer to harvest this information and supports annotation of EML elements requiring id+reference pairs.

To do:

Validate user supplied predicate and object URIs listed in annotations.txt to ensure the annotations are resolvable. Implement this in validate_templates().
Update vignettes

clnsmth commented 4 years ago

Extend annotation support to:

/eml/dataset/coverage/geographicCoverage
/eml/dataset/coverage/taxonomicCoverage

earnaud commented 3 years ago

Hi Colin,

I do not find the annotations related function in EAL v2.5.0. Is this expected?

EDIT

There seems to be an error on the exporting of this function. Even in "development" branch, although I can access the documentation for template_annotations(), I get those feedbacks:

> ?EMLassemblyline::template_annotations
Warning messages:
1: In mget(objectNames, envir = ns, inherits = TRUE) :
  internal error -3 in R_decompress1
2: In mget(objectNames, envir = ns, inherits = TRUE) :
  restarting interrupted promise evaluation
3: In mget(objectNames, envir = ns, inherits = TRUE) :
  internal error -3 in R_decompress1

Similarly:

> EMLassemblyline::template_annotations(path = "~/dataPackagesOutput/emlAssemblyLine/bdd_kalila_emldp/bdd_kalila/metadata_templates/", data.path = "~/dataPackagesOutput/emlAssemblyLine/bdd_kalila_emldp/bdd_kalila/data_objects/", data.table = dir("~/dataPackagesOutput/emlAssemblyLine/bdd_kalila_emldp/bdd_kalila/data_objects/"))
Error: 'template_annotations' is not an exported object from 'namespace:EMLassemblyline'

clnsmth commented 3 years ago

It's in the development branch, which is currently at v 2.19.0. Support for annotations didn't exist at 2.5.0.

earnaud commented 3 years ago

Woh .. I succeeded into confusing "2.19.0" with "2.1.9" ... sorry for this :/

clnsmth commented 3 years ago

Is template_annotations() working for you @earnaud?

earnaud commented 3 years ago

It worked in command line mode, I will definitely tell you within a MetaShARK workflow try.

clnsmth commented 3 years ago

Extend support to /eml/dataset/coverage/geographicCoverage

mobb commented 3 years ago

Note on annotation: The EML Parser requires that the parent element of an annotation have an id attribute. e.g, if you include this path: /eml:eml/dataset/annotation, you must also include /eml:eml/dataset@id/

Not sure yet what that attribute's content should be (<dataset id="">. To me, what makes the most sense is that is the same as the packageId, e.g. <dataset system="https://pasta.edirepository.org" id="edi.437.2"> However, if dataset id and packageId are the same string, it violates another requirement of the EML parser. Opened up a slack chat on this today.

Follow up after slack discussion: this might be better: <dataset system="https://pasta.edirepository.org" id="edi.437"> It's not a good idea to relax the parser (will not go into reasons here).

This comment was added because the EAL code for annotations will need to add that parent element ID, anywhere there is an annotation.

yvanlebras commented 3 years ago

Thank you Margaret! Amazing as I was working yesterday on adding annotations to existing data packages from our repository and was thinking exactly same kind of things concerning adding an annotation to a dataset element... I have to continue my test better understanding the implementation of such annotations but your point is very important....

yvanlebras commented 3 years ago

ok, So I finally finish to have a sucessfull test EML with one annotation tag at the attribute level + one annotation tag at the dataset level. It was not so easy to understand that, if I well understood, you need to add an id to an annotation tag if there is at least one existing annotation tag somewhere on the EML.... And this is additionnaly to the fact you need to have an id for "parent" element of an annotation tag. Here https://data.test.pndb.fr/view/urn:uuid:c35a5384-331c-4776-be69-e30263851bdf I "just" add a randomly choosed id for the dataset tag dataset-01 not related to the packageID.

clnsmth commented 3 years ago

template_annotations() creates unique ids for use within a dataset and annotate_eml() adds them to the EML. You could modify the ids listed in the annotations.txt template but will work fine if you don't.

clnsmth commented 2 years ago

This feature is implemented and available in the master branch.

yvanlebras commented 2 years ago

Amaaaazing ! Thank you so much Colin! Look forward to see the MetaShARK implementation @earnaud ;)

EDIorg / EMLassemblyline

Semantic annotation #31