dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Add support for document-level key-value metadata #1156

Open reckart opened 7 years ago

reckart commented 7 years ago

Add support for document-level key-value metadata. I imagine something like this:

=== Variant 1

MetaDataEntry extends Annotation  {
  String: key
  String: value
}

// Simplest option only allowing String key-value pairs

=== Variant 2

// Option only allowing basic typed key-value pairs with values represented as strings
// The type would be set if the value is not a string - and it would be set e.g. to `int`, `bool`, etc.

MetaDataEntry extends Annotation  {
  String: key
  String: value
  String: type
}

=== Variant 3

// Rather have everything in one FS; either value or ref would be set, but not both
// If ref is set, then values would be retrieved from the linked FS (key-values again)

MetaDataEntry extends Annotation  {
  String: key
  String: value
  FeatureStructure: ref
  String: type
}

=== Variant 4

// Full support for all kinds of structures, even nested entries - basically "schemaless"

MetaDataEntry extends Annotation {
  String: key
}

PrimitiveMetaDataEntry extends MetaDataEntry  {
  String: value
  String: type
}

MetaDataEntryGroup extends MetaDataEntry  {
  MetaDataEntry[]: items
}

Instead of adding the MetaDataEntry to a view, adding it to a list of MetaDataEntry that could be created on DocumentMetaData:

DocumentMetaData extends DocumentAnnotation {
   // ... all the stuff we already have in DocumentMetaData ...
   MetaDataEntry[]: entries
}

Alternative to extending Annotation would be to extend TOP and then only adding it to DocumentMetaData and not to the CAS view directly. That would mean that the MetaDataEntry could not be retrieved via the annotation index / via offsets. But it is expected that the offsets would always cover the whole document anyway. This could be a problem and require special handling if the annotations are added before the text is materialized; the respective code would have to know that all the MetaDataEntry annotations would need to be updated to match the materialized text in the end. UIMA handles this automatically for us for the DocumentAnnotation.

reckart commented 7 years ago

@jgrivolla

reckart commented 7 years ago

At the moment, I kind of tend towards: