The Ultimate Metamodel - Githubissues

AtesComp commented 2 years ago

OWL based RDF. Use a tool like Protege (from Stanford University) to prove it. All other (meta)model formats can generate from it.

Isn't it about time that NIEM lives up to it's by-line "...a community-driven, standards-based approach to exchanging information" by joining a community that actually does...linked open data?

davaya commented 2 years ago

RDF is used by the knowledge graph community, and in my (limited) experience some practitioners have an almost religious opposition to UML datatypes, despite the fact that the explicit purpose of RDF collections and JSON-LD value objects is to define datatypes (structures that are NOT graph nodes).

Protege is also a graph explorer.

Since XSD and JSON Schema and CDDL and SQL are all schema languages, can you provide pointers to projects or tools that use RDF to define datatypes (which are inherently DAGs that can optionally define directed but cyclic references), not undirected knowledge graphs?

AtesComp commented 2 years ago

For "RDF collections and JSON-LD value objects", that depends on your definition of a graph node.

I haven't encountered any opposition to any definitions to datatypes. RDF uses many XSD datatypes as primitives by default and explicitly defined a mechanism to create complex user defined datatypes within OWL.

All Semantic Web / RDF graphs are directed knowledge graphs, so I'm puzzled about your reference to undirected graphs. In fact, that is why OWL embodies class axioms for Transitivity, Symmetry, Asymmetry, Functional, Inverse Functional, etc., property definitions to provide inferencing control over directional needs.

As for tools, Protoge is such a tool that allows for such definitions. Any RDF-based ontology tool will allow for such definitions. All Semantic Web course work I've reviewed explicitly covers the development of user defined datatypes. In "A Semantic Web Primer", I quote:

Strictly speaking, the use of any externally
defined datatyping scheme is allowed in
RDF documents, but in practice, the most
widely used datatyping scheme will be the
one by XML Schema. XML Schema predefines
a large range of datatypes, including booleans,
integers and floating point numbers, times
and dates.

So I'm not entirely sure why you have such a perception.

There are two concepts of type: object types and literal datatypes. We may be speaking in cross terms so I hope to clarify. For object types, you are free to describe class and class restrictions to refine your concept of a class which may contain many sub elements...a class that describes a meeting that must have 5 to 10 members, 2 to 3 of which must be doctors (belong to the doctor class), and 2 to 6 of which must be patients and 1 who must be a mediator. Things of this nature are easily definable and should be for your use case. For literal datatypes, you are free to define content definitions for those strings using owl:withRestrictions and xsd:pattern. For example, see https://lists.w3.org/Archives/Public/semantic-web/2021Mar/0069.html Again, you are free to do so for your use case. Anybody who says otherwise is a raging idiot.

AtesComp commented 2 years ago

Also, the TypeDB article is a "reinvention of the wheel" already present in the Semantic Web standard. I'm not entirely sure why the author make such statements about table structures and binary edge limitations and then proceeds to demonstrate TypeDB solutions using binary edge examples. I find that extremely comical.

The entire discussion on TypeDB is only a partial solution that duplicates much of the Semantic Web standard that has existed for years. You can take any of the example and translate them into an RDF solution and get better answers because...namespace as an example.

See the following three references for more info: https://www.ontotext.com/knowledge-hub/ https://cambridgesemantics.com/blog/semantic-university/ https://allegrograph.com/article/news-and-events/conferences-and-seminars/ Yes, they are commercial tools and competitors but, since they all use the same standard, they are interchangeable. There are many open source solutions as well. For example: Jens, Fuseki RDF4J RDFLib So, not locked into one company's or organization's idea of what the solution looks like as they use a common (and only) standard for knowledge graphs.

davaya commented 2 years ago

Thanks for the references. The XML schema quote lists only primitives (booleans, integers and floating point numbers) and scalars (times and dates). What it doesn't list is collections (maps and arrays) with UML collection properties (unique and ordered).

"All Semantic Web / RDF graphs are directed knowledge graphs" - yes, the edges have directed names, but the relationships are symmetric - if a car is composed of wheels then wheels are components of cars. In data structures you make a choice - a car has an array of components property, or components (wheel) have an array of cars property. You can take an "undirected" RDF graph, pick any node and drag it to the top (make it a root), and all other connected nodes will hang from it in a data structure DAG. Of course some roots are much more practical, but the RDF graph doesn't identify them, nor does it distinguish values from references. Data structures must be acyclic, but if cars have an array of wheels, then wheels can have references back to cars, or vice versa. Long cycles are harder to spot but just as toxic, so they must be broken with references before data can be serialized.

AtesComp commented 2 years ago

Collection are an integral part of RDF. There are general containers (rdfs:Container), such as rdf:seq, rdf:bag, rdf:alt, so unique and ordered...and you can define your own. It also does actual collections (linked list) using rdf:List, rdf:first, rdf:rest, and rdf:nil. This is integral to the OWL2 standard as it uses lists extensively to define ontology.

See https://www.w3.org/TR/rdf-schema/

Everything you state about RDF is in error. No, the edges are not symmetric unless you define it so in your ontology...it's up to you. And, no, RDF does not require DAG and never has. Data structures are not required to be DAG. Ontology definition is by it's very nature not DAG. And, yes, it does distinguish between reference (a resource) and value (a literal).

Also, you don't pick nodes and drag them to the top. Nodes are just nodes and you may organize them however you like. The query mechanism, SPARQL, requires that you use the directed graph connections you specify in the data the way you define them. If you have defined inferred symmetry, the you can query as such.

I'm not sure where you got your information but it is patently wrong on every point.

davaya commented 2 years ago

Your concept of data is patently wrong. Show an example of JSON objects and arrays where Node A contains Node B and Node B contains Node A.

AtesComp commented 2 years ago

JSON is NOT RDF!

davaya commented 2 years ago

Duh. But since NIEM is about data, and JSON and XML and SQL and CBOR and Protobuf and Avro and SBE are data, you are commenting on the wrong repo. https://www.niem.gov/sites/default/files/inline-images/Metamodel-Screenshot6.jpg

AtesComp commented 2 years ago

You are confusing a transport format with the actual underlying data representation...a common misnomer. "JSON and XML and SQL and CBOR and Protobuf and Avro and SBE" are file formats that contain data. For instance, while XML and JSON are tree structure formats, the contained RDF data using either format is, in general, not a tree structure. The format may be DAG, the data is not. The contained graph data is simply expressed using these formats to transmit the graph. It is then loaded into a system supporting that graph standard and applies the standard's methodologies...as mentioned previously.

Then, the contained graph is not the document structure. For instance, JSON-LD alone can represent the same RDF graph in about 6 different ways. Each file format can represent the exact same graph in different ways depending on the producer. Surely you are not suggesting a metamodel wholly depends on a file format's structure. There is nothing meta about that approach. Are you saying NEIM is about file formats?

AtesComp commented 2 years ago

Clearly, this is an open issue...accidentally closed.

davaya commented 2 years ago

https://www.sciencedirect.com/topics/computer-science/information-engineering includes: "Information Engineering recognized that data has inherent properties of its own, independent of how it is used.", and ER design methodology includes conceptual, logical and physical data models. One form of physical data is the "document", a sequence of bytes, and much communication makes use of PDF documents, web (HTML) documents, image (PNG, JPG, ...) documents, and "messages" (documents or "protocol data units" (PDUs) that are exchanged between systems).

Since information assurance (my field) requires data integrity (the ability to compute a hash or signature over an instance of a datatype), I'm going with https://www.w3.org/TR/xmlschema11-1/, whose purpose is to "describe the structure and constraining the contents of XML documents".

Yes there is conceptual and logical data. I'm referring to physical data, protocol data units, documents, or messages - things that can be hashed and signed. My position is that the NIEM CMF needs to support data integrity, which implies that it must be possible to classify any sequence of bytes as either a valid instance or not of a CMF datatype. All such byte sequences, regardless of data format, can then be checked for integrity as well as validity.

And referring to my first comment on religious objections to using RDF collection or JSON-LD value object to define datatypes/messages/documents, I'd love to be shown real examples of how they are actually used that way. They exist because documents (datatypes that are not part of any graph) obviously exist, but in my experience practitioners eschew them.

davaya commented 2 years ago

"the contained RDF data using either [JSON or XML] format is [a graph]"

My point is that documents in general do not contain RDF data. An Image datatype, for example, has metadata (timestamp, exposure information, geotag, etc) and data (a 2d array of pixels with dimensions width and height). The geotag has latitude, longitude and geodetic datum.

There is no RDF data in an Image document, there is just the information defined by the datatype. Image contains Metadata contains Geotag contains Latitude - that's a directed graph. An RDF collection (first, rest, nil) can define Image, but show me where anyone has actually defined Image using an RDF collection. JSON schema and XSD are routinely used to define datatypes like Image.

RDF people usually say "image is a picture of the Mona Lisa by da Vinci on display at the Louvre", but they don't define what the value of pixel [1428, 824] is, and they can't use an Image document in RDF format to display the Mona Lisa on your screen.

AtesComp commented 2 years ago

I completely agree with the data integrity issue for instances of a data model / format. On review, we are evidently discussing two very different issues. So, apologies. We were indeed speaking past each other. My interpretation of "NIEM-Metamodel" means something different to me. I've broken these down into three areas:

My Issue
Your Issue
Joint Issue

My Issue: My concern is about an effective metamodel for NIEM itself--a NIEM Metamodel...the ontology of NIEM. A NIEM ontology can be used directly in an RDF store to manage NIEM data--format agnostic--and transform it in any way desired. Queries on such a store don't place any emphasis on the format of the data per se, you just get a result set. Results could be keyed to document node types that could serve a roots for a specific output conversion. Since the data stores are designed to include federated and public services, you don't strictly need an elaborate file schema to represent the data...any result a user is interested in will suffice.

The results are generally tabular as per SQL. The columns describe the data relationships.
The results can be a graph. The resultant graph self-describes the data relationships.
Any result can be characterized as a "document".
The results stream pipe can be designed for secure access, encryption, etc.
Result sets can be described in various supported formats.

For ontology example, see: https://github.com/oasis-tcs/tac-ontology While this is more about the substructures, the idea of the formatted structure is bound in a STIX bundle. Things found in NIEM data can relate to things found in STIX data...there are ontological equivalences. Then, each may enhance an support the other.

Your Issue: As I read it, you're concerned about how a metamodel may represent specific transport formats and then use it to validate such a file--a parsing problem. As a general purpose modeling system, an RDF-based ontology can do this. It can describe all the components of a file format. All the fields of a file, its descriptors, byte sizes, variable length, whatever sub-components, and create a composite class containing those items and its structure as "restrictions". Additionally, you can describe a mechanism for verification and validation for such a format. It is no different than describing any other class--an abstraction for a structure. You can use that ontology to project concrete instances from queries on the store. In fact, you would likely design this as a set of query used to produce bespoke, compliant documents of whatever format with an exporter. An importer could do the same.

Alternatively or in addition to, you could store a native file schema as a blob or string (an XSD or JSON Schema) and apply it as needed using the native tools and even store the configuration for how that is done. In this way, RDF augments and is augmented by native tools.

I think this is where you get those "religious" arguments resistance to do such a thing. They are likely saying, "Just get the data directly and skip the whole file format indirection...structure is embedded in the graph...more importantly, is the data actually trustworthy? Anyone can create any physical format using whatever and no one is required to use it."

Admittedly, storing a file with it's structure in RDF format is not efficient or practical for a number of reasons. Duplicating the value of pixel [1428, 824] in RDF is redundant if the file is maintained somewhere accessible. A process can get the value of pixel [1428, 824] using an RDF store to direct and process a known image location. Storing the file as a blob (base64) ... maybe, if an external file location is unreliable. Definitely metadata not readily available from the file directly such as MD5 or SHA256 hashes, etc. Just like is not practical to compile software directly from UML. However, use cases vary. Use as needed.

Joint Issue: For representation of NIEM data in an RDF graph format, validating the structured components (sub-graphs) in a graph, you use SHACL: https://www.w3.org/TR/shacl/ The examples there should give you a good grounding for the validation of RDF data structure shape and content. This is generally used on ingestion to validate data going into the store conforms to the desired structures.

While searching on these issue, I came across this interesting OWL to UML paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7742&rep=rep1&type=pdf

NIEM / NIEM-Metamodel

The Ultimate Metamodel #1