Closed jakebeal closed 3 years ago
I'm afraid I don't know how SBOL has approached this.
I don't believe that SBOL is a good parallel, anyway. A data repository for information about designs is a relatively good example of a case where the open world makes sense: you want the information that's available and you don't necessarily want to get nothing about a design (or other entity) just because the person supplying that information doesn't have all the information.
On the other hand, when you are specifying a protocol or, even more, an api, you have quite strong preferences about what has to be there, and since this will lead to execution, the trade-off is different -- you're probably better off with no protocol than a protocol that could do something weird and potentially damage your lab equipment.
Let me suggest that we may want to think about this not in terms of strictly open / strictly closed, but in terms of some intermediate (and readily checkable) levels of closure.
Here is how I am thinking about this, from most open to least open in terms of OPIL (PAML will be a bit more complex, but follow similar principles):
ExecutionRequest
, and it's valid to freely mix in some other triples specify another fragment of the ExecutionRequest
, plus some shreds of the ProtocolInterface
that it makes use of. Highly undesirable, for the reasons that you specify.ExecutionRequest
, then I assume that I have all of the information about that ExecutionRequest
, plus all of its associated ParameterValue
, SampleSet
and Measurement
objects, their children, etc. I might not have the ProtocolInterface
to which it refers, and I know nothing about whether other ProtocolInterface
or ExecutionRequest
objects exist. This lets me reason safely about the ExecutionRequest
, but it's possible the ProtocolInterface
will change on me.ProtocolInterface
as well, and all of its children. Now the ProtocolInterface
is guaranteed to be known, but always has to be carried around redundantly, and we can still have a conflict when the lab sees that the ProtocolInterface
that it has been sent isn't identical to the copy that it is holding internally.ExecutionRequest
or ProtocolInterface
then it doesn't exist. This seems overly strict and I'm not sure of the actual value over level 3.Currently, OPIL assumes level 2 closure, which is what SBOL uses as well. SBOL previously assumed level 3 closure, but weakened to level 2 once the things that people were talking about started to become large. Since OPIL is pretty small, I think we could readily do either level 2 or level 3, but I think the benefits of level 3 can also be achieved by other more lightweight methods like version uids. Level 3 also is likely to pose problems when moving from OPIL to PAML, because it prohibits dynamic binding in an execution environment, which is a major benefit that we are looking for.
This makes sense to me, except what does it mean to say "I assume I have got"? That is why we need a notion of "document" or other notion of "context." Closure must be relative to some context.
What is the protocol specification relative to which this assumption is to be made?
With such a notion, we would be able to make claims about completeness that could be evaluated as either true or false.
Note that this could be a relatively flexible notion of context -- it could be context only defined as "a set of triples" or some notion that is equivalent to that.
Given that, we could capture the notion of closure explicitly, without relying on informal terminology. E.g., if we had a set of assertions, we could potentially even compute the closure, adding maxCardinality
assertions where we previously had only minCardinality
.
We might be able to define a computable model/inferential closure for a set of OPIL assertions, too, which would impose a simple definition of consistency (a context would be labeled as inconsistent, if no closure could be committed).
Note that such a notion of context might have to exclude statements. E.g., if a "cell line" was defined in terms of a "parent" relation, with a minimum cardinality constraint on the parent relation, such a definition would have to be excluded from the context, although mention of the cell line in the context could still be permitted, because together with reasonable additional conditions (the parent relation is antireflexive, e.g.) it would force any model to be infinite in size.
Actually, it occurs to me that we might be able to permit open world facts in the context, if we had a notion of cardinality that extended OWL's notion. But that would get complicated quickly, so probably not a good idea.
I'm not sure I'm following you on how you're thinking to compute the closure.
The notion of context that I know how to implement readily is the one based on TopLevel + children. In this approach, any context (file, serialization, object in memory, etc.) that contains a TopLevel definition is guaranteed to contain all information about that TopLevel and its children. Any other object that it references may or may not be accessible.
OK, so there are two issues I believe:
We can define a context in a way that is unambiguous as "A container holding RDF triples (or equivalent data structures) describing a TopLevel." Note that while this is unambiguous enough, it is also flexible enough to cover all of the possibilities in your examples above. This probably would require a little more elaboration to say what is excluded from a document -- presumably if you reference an entity in, e.g., OPIL, you don't want to define your document as including all the facts you know about that entity. Such a context can be either well-formed or ill-formed, depending on whether it meets OPIL's constraints. I don't believe the notion of "TopLevel + children" satisfies the definition of context, because it doesn't say what that thing is. Some set of statements about a TopLevel + children could be incomplete, and could be added to in order to get something that is complete. It's only that "something" that can be said to be complete or not. OWL is a dialect of logic, so it defines a notion of consistency, but not a notion of document, or completeness or incompleteness of description. As I pointed out earlier, it is designed to accommodate sets of statements all of whose models are infinite, and so couldn't be complete in your sense. I feel like there is something about the notion of "document" that triggers resistance on your part, but I don't know what it is.
OWL is a conventional, monotonic logic. This means that it cannot accommodate closure constraints unaided. Any closure constraint is necessarily non-monotonic. Such non-monotonic constraints include negation by failure, and cardinality inferences like "the only property values are the ones that are in my database". There are at least two ways one could deal with this:
foo
with min-cardinality 1 and max-cardinality 3. If you have 2 values for foo
, for an instance proto
, you just assert them as before, but now you also assert max-cardinality proto = min-cardinality proto = 2
, which satisfies the OWL class constraints, and tightens them in a way that reflects the closure you want.opil:minCardinality
and opil:maxCardinality
that entail the corresponding OWL relationships, but also are relationships that should be considered in the document closure computation. Otherwise, a defined closure computation could have undesirable conclusions if you were to add other cardinality assertions to a document that you want to have conventional force (e.g., every strain has at least one DNA sequence).I like the way that you are developing this.
On point 1, I resist the word "document" primarily because I've seen that lead to a lot of confusion in the past, since a lot of people then start thinking we're talking about serialization formats and start making assumptions about very strong closure assumptions, up at the level 4 of "Entirely closed world". If we called it something like "model" or "description", I'd be happy with that, or any other better suggestion.
I believe that "well-formed or ill-formed" fits well with the concept of "validation rules" that has been used with SBOL. Notably, SBOL formally distinguishes four classes of rules:
This approach for reasoning has been pretty successful for us in practice, whether or not it can be formalized in OWL. Thus, I think I'm leaning toward the beyond-OWL inference option, since I believe that's what we're on track to implement in the library.
@jakebeal I can see why you don't like the word "document." I don't really like "model," since it crashes into related concepts (like the model of the OWL statements). (It's models all the way down!). I guess we can't use "Protocol Specification," because OPIL covers both the protocol specification and its use. So we should use one term for the protocol specification, and one for a request (which will be relative to a protocol specification). If the protocol specification (or whatever we call it), is a meaningful unit, then it should have its own designator (probably an IRI), rather than using the URI of the ProtocolInterface
entity. Ideally this would not be necessary, but it's better safe than sorry and I note that OWL has a notion of "ontology" that is separate from any entity contained therein. That would potentially allow for extending the definition of well-formedness from a single ProtocolInterface
to a set thereof that is grouped together into whatever this ontology-like-OPIL-entity is. So now we need three names. Stop me before I ontologize again!
The validation rules notion makes sense to me, too. I think the notion can be formalized, but only by adding a notion of a "document" (whatever name we come up with) to the OWL background.
OWL has rules that tell us what is and isn't entailed by a set of statements, but beyond syntactic limitations and inconsistency, can't capture what one is and is not permitted to say, and you want to specify what one must (e.g., completeness) and must not (things that might be legit, but that would damage the plans you have for processing OPIL over and above just gazing upon it and marveling).
As we're finalizing SBOL 3.0.1, I've added a section describing the closure assumptions and their implications for document structure. @rpgoldman, can you please take a look and see if you see anything that needs to be adjusted in the language? Once its finalized and released, I'll pull it into the OPIL and PAML specs as well.
Note that I am by no means an expert in description logics (DLs), just someone who has banged up against them several times, and been forced to think about them harder than he would have liked!
What concerns me a little is that a lot of DL constructs, including those in OWL, have existential force. Of course the cardinality restrictions are the most obvious of those. So... is there any chance that such statements could get you into trouble when combined with your closure rule (negation by failure)? The most obvious case would involve incomplete knowledge. If you have the closure inference rule, a cardinality restriction of >= 1 on a property, but don't know the value of the property, then you could end up with an inconsistent KB. Returning to your example, what would you do if you know that every Plasmid has a sequence (cardinality == 1) but as you say, you don't know it (ergo cardinality == 0)?
The use of the rdfType
statement as a distinction between what is and is not to be considered complete is an interesting tactic. I'm torn about it. On the one hand it's a simple rule, and does not require additional machinery. On the other hand, if you rub SBOL up against other tools, you might end up accidentally violating this rule (e.g., if you have a tool that does DL reasoning to infer entity types). So for that reason you might prefer to have a more explicit signaling of what must be complete -- an explicitly meta statement that would live in your document. The hazy analogy is that this would be a modality rather than a property. You could introduce sbol:typeOf
that would be syntactic sugar for the combination, and then if you bang an SBOL database against Pellet or something like that you wouldn't have to worry about introducing inconsistencies (unless you have expressions of existential force that could introduce new TopLevel
entities).
Another issue -- this is probably just a nit -- do you have a definition of what it means for one entity to be the child of another? I haven't read the whole document by any means, but in OPIL you have pictures that use the UML conventions that distinguish between containment and reference. This is not a logical notion, so you might need to expand on how containment is to be interpreted.
Some notes on your enumeration:
Thank you for making me feel like an absolute pedant!
Working my way through here:
Returning to your example, what would you do if you know that every Plasmid has a sequence (cardinality == 1) but as you say, you don't know it (ergo cardinality == 0)?
Then that would be an invalid SBOL document. We have many cases like this, though they're generally restricted to things that you are certain to be able to know. Actually knowing a sequence, for example, is always optional.
The use of the rdfType statement ...
Yes, the reason that I think this makes sense is that the SBOL libraries use the presence of an rdfType as the indication that an object should be created. If RDF tools cause problems with this, that would be an interesting thing to find.
Do you have a definition of what it means for one entity to be the child of another?
Yes; it's in Section 3.2, where we talk about the UML diagram conventions.
"Any disjoint set of \sbol{TopLevel} objects from different SBOL documents MAY be composed to form a new SBOL document. The result is not guaranteed to be valid, however, since the composition may expose problems due to the relationships between \sbol{TopLevel} objects from different documents."
I think I'll keep it this way, since the prior bullet defines any extracted TopLevel as a valid SBOL document. The place where you can end up with problem is, for example, if a Component points to something that it thinks is a Sequence, but when you actually get the object, it turns out to be another Component instead. There's also a lot of more subtle issues that can show up too.
"If two \sbol{TopLevel} objects in different SBOL documents have the same identity and all other properties are equal as well, then they MAY be treated as identical and freely merged."
Fixed; added the child objects too.
Given the inference rules, aren't these two documents necessarily conflicting?
Yes, exactly. This is just spelling this out explicitly to be obvious to people who aren't as familiar with implications. In the past, people have attempted to incautiously merge objects in ways that have not necessarily gone well.
The use of the rdfType statement ...
Yes, the reason that I think this makes sense is that the SBOL libraries use the presence of an rdfType as the indication that an object should be created. If RDF tools cause problems with this, that would be an interesting thing to find.
Well, for example, a picture describing the Pellet ontology reasoner (one of the most popular) states that (among other things) it performs:
Realization, which finds the most specific classes that an individual belongs to; or in other words, computes the direct types for each of the individuals. Realization can only be performed after classification since direct types are defined with respect to a class hierarchy. ....
So this kind of inference is definitely within the purview of OWL reasoning engines. In general, I think it is dangerous to rely on the presence or absence of statements that may be part of the consequential closure of a document. Hence my suggestion that you add a special notation, rather than relying on rdfType. You mean something more than just that an object is of a particular type.
Alternatively, you could define an SBOL document in stronger ways that exclude its application to, say, arbitrary databases so that additional formulas in the consequential closure may be derivable, but cannot appear in the document (or it will become ill-formed). That may be necessary anyway, since you are defining the consequential closure of an SBOL document in ways that would be inconsistent with treating that document as an OWL document when taking its consequential closure.
[Hence again my extreme distaste for the use of OWL as a language for describing data.]
I think it may be less of a threat than you fear, since the propositions we'd be worried about in SBOL generally cannot be inferred.
Well, like I said, type predications can be because DL's main thing is classification...
I think it might be able to reason about sbol:type fields, but likely not draw new conclusions about the rdfType, because any valid SBOL document will already have all of its items set as disjoint leaf classes with respect to rdfType
I was more worried about the case where one might have restrictions on property values that could cause a DL engine to be able to infer things about the type of a property value, and accidentally insert type assertions. That's what worried me about using the rdfType assertions to know whether a document is an object's "home."
I think I'm going to close this as the primary issue having been resolved with the specification pull and its attendant adjustments per discussion in this thread. If there are concerns you feel aren't yet resolved, can you please open a new issue that focuses on those?
@rpgoldman points out a potential conceptual issue
We've approached this in the same way that SBOL has, and I believe that the approach is well-formed and sound, but it's also not necessarily explained correctly.