Open world / closed world model

SD2E / opil

The Open Protocol Interface Language (OPIL) is intended as a standard language for protocol interfaces

5 stars 1 forks source link

Open world / closed world model #146

Closed jakebeal closed 3 years ago

jakebeal commented 3 years ago

@rpgoldman points out a potential conceptual issue

This document is admirably clear, and reads well. It did raise one big issue for me, though. That is that it needs to define an OPIL document and what it means for such a document to be well-formed.

Here's an example of why:
All OPIL objects are given the most specific rdfType in the OPIL namespace (“http://bioprotocols.org/opil/v1#”) 13 that defines the type of the object. Namely, an object MUST have no more than one rdfType property in the 14 “http://bioprotocols.org/opil/v1#” namespace. 15
For example, an object cannot have both the rdfType of ProtocolInterface and MeasurementType. Also, a 16 MeasureParameter would have this rdfType and not also include rdfTypes for classes that it inherits from, such 17 as Parameter and sbol:Identified.
This constraint can't be enforced, or really, can't even be stated unambiguously, without a definition of an OPIL document.

RDF gives you a distributed data model, and new RDF triples can be asserted anywhere, and discuss arbitrary entities that are represented by URIs.

For this reason, the above restriction can't be enforced, and really doesn't make sense (give yourself the clarity test -- if you had omniscience about the set of RDF documents in the universe, could you answer this question, and if so, would it be the answer you want?).

Unless, that is, one adds a definition of an OPIL document, at which point restrictions like this snap into focus, as it were, and become actionable.

The same thing goes for cardinality restrictions. Do these refer to facts in the world (all organisms have a DNA sequence) or a document (all entities mentioned in an OPIL document must have a known DNA sequence -- where "known" means either specified in the document, or derivable by reference from a URL in the document)? I think you generally mean the latter, but that's not what you get from RDF semantics without a notion of document.

We've approached this in the same way that SBOL has, and I believe that the approach is well-formed and sound, but it's also not necessarily explained correctly.

rpgoldman commented 3 years ago

I'm afraid I don't know how SBOL has approached this.

I don't believe that SBOL is a good parallel, anyway. A data repository for information about designs is a relatively good example of a case where the open world makes sense: you want the information that's available and you don't necessarily want to get nothing about a design (or other entity) just because the person supplying that information doesn't have all the information.

On the other hand, when you are specifying a protocol or, even more, an api, you have quite strong preferences about what has to be there, and since this will lead to execution, the trade-off is different -- you're probably better off with no protocol than a protocol that could do something weird and potentially damage your lab equipment.

jakebeal commented 3 years ago

Let me suggest that we may want to think about this not in terms of strictly open / strictly closed, but in terms of some intermediate (and readily checkable) levels of closure.

Here is how I am thinking about this, from most open to least open in terms of OPIL (PAML will be a bit more complex, but follow similar principles):

Completely open RDF: I've got some triples that specify a fragment of an ExecutionRequest, and it's valid to freely mix in some other triples specify another fragment of the ExecutionRequest, plus some shreds of the ProtocolInterface that it makes use of. Highly undesirable, for the reasons that you specify.
TopLevel + children assumed complete: If I have an ExecutionRequest, then I assume that I have all of the information about that ExecutionRequest, plus all of its associated ParameterValue, SampleSet and Measurement objects, their children, etc. I might not have the ProtocolInterface to which it refers, and I know nothing about whether other ProtocolInterface or ExecutionRequest objects exist. This lets me reason safely about the ExecutionRequest, but it's possible the ProtocolInterface will change on me.
TopLevel + all dependencies assumed complete: Same as before, except now I have to have the ProtocolInterface as well, and all of its children. Now the ProtocolInterface is guaranteed to be known, but always has to be carried around redundantly, and we can still have a conflict when the lab sees that the ProtocolInterface that it has been sent isn't identical to the copy that it is holding internally.
Entirely closed world: Here we assume that if we don't know about a particular ExecutionRequest or ProtocolInterface then it doesn't exist. This seems overly strict and I'm not sure of the actual value over level 3.

Currently, OPIL assumes level 2 closure, which is what SBOL uses as well. SBOL previously assumed level 3 closure, but weakened to level 2 once the things that people were talking about started to become large. Since OPIL is pretty small, I think we could readily do either level 2 or level 3, but I think the benefits of level 3 can also be achieved by other more lightweight methods like version uids. Level 3 also is likely to pose problems when moving from OPIL to PAML, because it prohibits dynamic binding in an execution environment, which is a major benefit that we are looking for.

rpgoldman commented 3 years ago

This makes sense to me, except what does it mean to say "I assume I have got"? That is why we need a notion of "document" or other notion of "context." Closure must be relative to some context.

What is the protocol specification relative to which this assumption is to be made?

With such a notion, we would be able to make claims about completeness that could be evaluated as either true or false.

Note that this could be a relatively flexible notion of context -- it could be context only defined as "a set of triples" or some notion that is equivalent to that.

Given that, we could capture the notion of closure explicitly, without relying on informal terminology. E.g., if we had a set of assertions, we could potentially even compute the closure, adding maxCardinality assertions where we previously had only minCardinality.

We might be able to define a computable model/inferential closure for a set of OPIL assertions, too, which would impose a simple definition of consistency (a context would be labeled as inconsistent, if no closure could be committed).

Note that such a notion of context might have to exclude statements. E.g., if a "cell line" was defined in terms of a "parent" relation, with a minimum cardinality constraint on the parent relation, such a definition would have to be excluded from the context, although mention of the cell line in the context could still be permitted, because together with reasonable additional conditions (the parent relation is antireflexive, e.g.) it would force any model to be infinite in size.

rpgoldman commented 3 years ago

Actually, it occurs to me that we might be able to permit open world facts in the context, if we had a notion of cardinality that extended OWL's notion. But that would get complicated quickly, so probably not a good idea.

jakebeal commented 3 years ago

I'm not sure I'm following you on how you're thinking to compute the closure.

The notion of context that I know how to implement readily is the one based on TopLevel + children. In this approach, any context (file, serialization, object in memory, etc.) that contains a TopLevel definition is guaranteed to contain all information about that TopLevel and its children. Any other object that it references may or may not be accessible.

rpgoldman commented 3 years ago

OK, so there are two issues I believe:

We can define a context in a way that is unambiguous as "A container holding RDF triples (or equivalent data structures) describing a TopLevel." Note that while this is unambiguous enough, it is also flexible enough to cover all of the possibilities in your examples above. This probably would require a little more elaboration to say what is excluded from a document -- presumably if you reference an entity in, e.g., OPIL, you don't want to define your document as including all the facts you know about that entity. Such a context can be either well-formed or ill-formed, depending on whether it meets OPIL's constraints. I don't believe the notion of "TopLevel + children" satisfies the definition of context, because it doesn't say what that thing is. Some set of statements about a TopLevel + children could be incomplete, and could be added to in order to get something that is complete. It's only that "something" that can be said to be complete or not. OWL is a dialect of logic, so it defines a notion of consistency, but not a notion of document, or completeness or incompleteness of description. As I pointed out earlier, it is designed to accommodate sets of statements all of whose models are infinite, and so couldn't be complete in your sense. I feel like there is something about the notion of "document" that triggers resistance on your part, but I don't know what it is.
OWL is a conventional, monotonic logic. This means that it cannot accommodate closure constraints unaided. Any closure constraint is necessarily non-monotonic. Such non-monotonic constraints include negation by failure, and cardinality inferences like "the only property values are the ones that are in my database". There are at least two ways one could deal with this:
1. Add to the document facts you need explicitly, and stay within OWL. For example, let's say you have a property foo with min-cardinality 1 and max-cardinality 3. If you have 2 values for foo, for an instance proto, you just assert them as before, but now you also assert max-cardinality proto = min-cardinality proto = 2, which satisfies the OWL class constraints, and tightens them in a way that reflects the closure you want.
2. Define an inference procedure that computes additional inferences over your document justified by the document's contents and some additional closure constraints. Note that these might be computed either in one fell swoop after the document is finalized (forward chaining), or they might be computed only as needed (backward chaining), and one could be agnostic, and just specify the inference rules.
  If you were going to do this, I think it would be both more kosher and safer to define a notion of opil:minCardinality and opil:maxCardinality that entail the corresponding OWL relationships, but also are relationships that should be considered in the document closure computation. Otherwise, a defined closure computation could have undesirable conclusions if you were to add other cardinality assertions to a document that you want to have conventional force (e.g., every strain has at least one DNA sequence).

jakebeal commented 3 years ago

I like the way that you are developing this.

On point 1, I resist the word "document" primarily because I've seen that lead to a lot of confusion in the past, since a lot of people then start thinking we're talking about serialization formats and start making assumptions about very strong closure assumptions, up at the level 4 of "Entirely closed world". If we called it something like "model" or "description", I'd be happy with that, or any other better suggestion.

I believe that "well-formed or ill-formed" fits well with the concept of "validation rules" that has been used with SBOL. Notably, SBOL formally distinguishes four classes of rules:

Strong REQUIRED condition: must be true to be well-formed, and can always be checked. For example, the cardinality of a property.
Weak REQUIRED condition: must be true to be well-formed, but can only be checked for the set of TopLevel entities for which you have descriptions. For example, whether a set of references is acyclic.
Uncheckable Weak REQUIRED condition: must be true to be well-formed, but not machine-checkable. For example, whether you have described a concept using the correct term.
RECOMMENDED condition: best practices that are optional to follow

This approach for reasoning has been pretty successful for us in practice, whether or not it can be formalized in OWL. Thus, I think I'm leaning toward the beyond-OWL inference option, since I believe that's what we're on track to implement in the library.

rpgoldman commented 3 years ago

@jakebeal I can see why you don't like the word "document." I don't really like "model," since it crashes into related concepts (like the model of the OWL statements). (It's models all the way down!). I guess we can't use "Protocol Specification," because OPIL covers both the protocol specification and its use. So we should use one term for the protocol specification, and one for a request (which will be relative to a protocol specification). If the protocol specification (or whatever we call it), is a meaningful unit, then it should have its own designator (probably an IRI), rather than using the URI of the ProtocolInterface entity. Ideally this would not be necessary, but it's better safe than sorry and I note that OWL has a notion of "ontology" that is separate from any entity contained therein. That would potentially allow for extending the definition of well-formedness from a single ProtocolInterface to a set thereof that is grouped together into whatever this ontology-like-OPIL-entity is. So now we need three names. Stop me before I ontologize again!

rpgoldman commented 3 years ago

The validation rules notion makes sense to me, too. I think the notion can be formalized, but only by adding a notion of a "document" (whatever name we come up with) to the OWL background.

OWL has rules that tell us what is and isn't entailed by a set of statements, but beyond syntactic limitations and inconsistency, can't capture what one is and is not permitted to say, and you want to specify what one must (e.g., completeness) and must not (things that might be legit, but that would damage the plans you have for processing OPIL over and above just gazing upon it and marveling).

jakebeal commented 3 years ago

As we're finalizing SBOL 3.0.1, I've added a section describing the closure assumptions and their implications for document structure. @rpgoldman, can you please take a look and see if you see anything that needs to be adjusted in the language? Once its finalized and released, I'll pull it into the OPIL and PAML specs as well.

rpgoldman commented 3 years ago

Note that I am by no means an expert in description logics (DLs), just someone who has banged up against them several times, and been forced to think about them harder than he would have liked!

What concerns me a little is that a lot of DL constructs, including those in OWL, have existential force. Of course the cardinality restrictions are the most obvious of those. So... is there any chance that such statements could get you into trouble when combined with your closure rule (negation by failure)? The most obvious case would involve incomplete knowledge. If you have the closure inference rule, a cardinality restriction of >= 1 on a property, but don't know the value of the property, then you could end up with an inconsistent KB. Returning to your example, what would you do if you know that every Plasmid has a sequence (cardinality == 1) but as you say, you don't know it (ergo cardinality == 0)?

The use of the rdfType statement as a distinction between what is and is not to be considered complete is an interesting tactic. I'm torn about it. On the one hand it's a simple rule, and does not require additional machinery. On the other hand, if you rub SBOL up against other tools, you might end up accidentally violating this rule (e.g., if you have a tool that does DL reasoning to infer entity types). So for that reason you might prefer to have a more explicit signaling of what must be complete -- an explicitly meta statement that would live in your document. The hazy analogy is that this would be a modality rather than a property. You could introduce sbol:typeOf that would be syntactic sugar for the combination, and then if you bang an SBOL database against Pellet or something like that you wouldn't have to worry about introducing inconsistencies (unless you have expressions of existential force that could introduce new TopLevel entities).

Another issue -- this is probably just a nit -- do you have a definition of what it means for one entity to be the child of another? I haven't read the whole document by any means, but in OPIL you have pictures that use the UML conventions that distinguish between containment and reference. This is not a logical notion, so you might need to expand on how containment is to be interpreted.

Some notes on your enumeration:

"Any disjoint set of \sbol{TopLevel} objects from different SBOL documents MAY be composed to form a new SBOL document. The result is not guaranteed to be valid, however, since the composition may expose problems due to the relationships between \sbol{TopLevel} objects from different documents." I'm trying to see how this combination could fail, but I realize I'm not sure that I fully get the definition of the definition of the antecedent here. Do you mean "Any set of well-formed sub-documents from different SBOL documents, whose top level objects are pairwise disjoint"? The reason I am picking this particular nit is that you need to know that you have documents, not just objects, to characterize what you are assembling together.
(slight rephrasing:) "If two \sbol{TopLevel} objects in different SBOL documents have the same identity and all other properties are equal as well, then they MAY be treated as identical and freely merged." I think once again, you can't really say "all other properties are equal as well," but something more like "... and contain equivalent sets of property assertions." The properties in a sense must be equal: anything else is inconsistent. When you are talking about combining documents, you must talk in terms of documents, not in terms of their semantics. (you may have to think about what this says about unique naming, or its absence -- see below)
"If two \sbol{TopLevel} objects in different SBOL documents have the same identity but assert some different property values, then they MUST be considered different, conflicting versions, and any merger managed through some version control process." Given the inference rules, aren't these two documents necessarily conflicting? I think, actually, the precise answer to that question relies on whether you make the unique names assumption or not. Most logic programming frameworks make the unique name assumption as well as the CWA. I suspect (I will check) that DL frameworks assume neither.

Thank you for making me feel like an absolute pedant!

jakebeal commented 3 years ago

Working my way through here:

Returning to your example, what would you do if you know that every Plasmid has a sequence (cardinality == 1) but as you say, you don't know it (ergo cardinality == 0)?

Then that would be an invalid SBOL document. We have many cases like this, though they're generally restricted to things that you are certain to be able to know. Actually knowing a sequence, for example, is always optional.

The use of the rdfType statement ...

Yes, the reason that I think this makes sense is that the SBOL libraries use the presence of an rdfType as the indication that an object should be created. If RDF tools cause problems with this, that would be an interesting thing to find.

Do you have a definition of what it means for one entity to be the child of another?

Yes; it's in Section 3.2, where we talk about the UML diagram conventions.

"Any disjoint set of \sbol{TopLevel} objects from different SBOL documents MAY be composed to form a new SBOL document. The result is not guaranteed to be valid, however, since the composition may expose problems due to the relationships between \sbol{TopLevel} objects from different documents."

I think I'll keep it this way, since the prior bullet defines any extracted TopLevel as a valid SBOL document. The place where you can end up with problem is, for example, if a Component points to something that it thinks is a Sequence, but when you actually get the object, it turns out to be another Component instead. There's also a lot of more subtle issues that can show up too.

"If two \sbol{TopLevel} objects in different SBOL documents have the same identity and all other properties are equal as well, then they MAY be treated as identical and freely merged."

Fixed; added the child objects too.

Given the inference rules, aren't these two documents necessarily conflicting?

Yes, exactly. This is just spelling this out explicitly to be obvious to people who aren't as familiar with implications. In the past, people have attempted to incautiously merge objects in ways that have not necessarily gone well.

rpgoldman commented 3 years ago

The use of the rdfType statement ...

Yes, the reason that I think this makes sense is that the SBOL libraries use the presence of an rdfType as the indication that an object should be created. If RDF tools cause problems with this, that would be an interesting thing to find.

Well, for example, a picture describing the Pellet ontology reasoner (one of the most popular) states that (among other things) it performs:

Realization, which finds the most specific classes that an individual belongs to; or in other words, computes the direct types for each of the individuals. Realization can only be performed after classification since direct types are defined with respect to a class hierarchy. ....

So this kind of inference is definitely within the purview of OWL reasoning engines. In general, I think it is dangerous to rely on the presence or absence of statements that may be part of the consequential closure of a document. Hence my suggestion that you add a special notation, rather than relying on rdfType. You mean something more than just that an object is of a particular type.

Alternatively, you could define an SBOL document in stronger ways that exclude its application to, say, arbitrary databases so that additional formulas in the consequential closure may be derivable, but cannot appear in the document (or it will become ill-formed). That may be necessary anyway, since you are defining the consequential closure of an SBOL document in ways that would be inconsistent with treating that document as an OWL document when taking its consequential closure.

[Hence again my extreme distaste for the use of OWL as a language for describing data.]

jakebeal commented 3 years ago

I think it may be less of a threat than you fear, since the propositions we'd be worried about in SBOL generally cannot be inferred.

rpgoldman commented 3 years ago

Well, like I said, type predications can be because DL's main thing is classification...

jakebeal commented 3 years ago

I think it might be able to reason about sbol:type fields, but likely not draw new conclusions about the rdfType, because any valid SBOL document will already have all of its items set as disjoint leaf classes with respect to rdfType

rpgoldman commented 3 years ago

I was more worried about the case where one might have restrictions on property values that could cause a DL engine to be able to infer things about the type of a property value, and accidentally insert type assertions. That's what worried me about using the rdfType assertions to know whether a document is an object's "home."

jakebeal commented 3 years ago

I think I'm going to close this as the primary issue having been resolved with the specification pull and its attendant adjustments per discussion in this thread. If there are concerns you feel aren't yet resolved, can you please open a new issue that focuses on those?