What do we want from our `UID`s?

JacquesCarette / Drasil

Generate all the things (focusing on research software)

https://jacquescarette.github.io/Drasil

BSD 2-Clause "Simplified" License

137 stars 26 forks source link

What do we want from our `UID`s? #2911

Open balacij opened 2 years ago

balacij commented 2 years ago

This kind of hack strongly tells me that we need to revamp our UID implementation to something more structured. I know MMT uses URIs for doing something similar. I don't like that either, but it is (slightly) better than name concatenation.

I'm okay with keeping the hack for now, as we can easily find it by looking for +++.

_Originally posted by @JacquesCarette in https://github.com/JacquesCarette/Drasil/pull/2904#discussion_r760502476_

Please note that this discussion starts from @JacquesCarette's comment above and on that comment thread, and then continues to this issue.

This ticket is also related to #2788. Additionally, @Ant13731 and I also previously discussed that we lose information with the plain String-concatenation for UIDs in https://github.com/JacquesCarette/Drasil/issues/2775#issuecomment-895466586 .

JacquesCarette commented 2 years ago

UID = "Unique IDentifier". It is a way to refer to a piece of 'data', and gives an 'identity' to data.

Using strings for this was a horrible hack that worked while we were figuring things out. At least now, we kind of have an abstract type for it, so that the outside world shouldn't know our internal representation. Which shouldn't be a string.

In the web world, URIs are a 'solution' to provide addresses for resources. Related are URNs.

Although URIs and URNs are encoded as strings, they possess a grammar so that their parts can be recovered (by an easy parse). The problem with that is such a scheme means that the names of 'inner' items either need to be restricted (to avoid 'code' characters) or these need to be escaped. This is useful for external transmission - but since UIDs are completely internal, we are free to design out own Abstract Data Type.

The point is that our 'resources' (knowledge, information, in large part) do happen to live in contexts. If these contexts have names, then we can indeed use name spaces as a means to have short local referents (our current UIDs) that are still unique because they sit in their own name space.

Part of the problem is that we have derived information, for which we don't have a good scheme that generates names for them. For example, we could have many things have the same 'name', but if they had different 'types', and those types were part of our own UID/URI/URN scheme, then they would not be the same anymore. This is what the appending hack tries to fake.

I'm a bit confused by our existing ConceptDomains; I'm having a hard time understanding which "level" we exactly want to call a conceptual domain for something. Do you think a new UID scheme could/should address this?

I don't think we want to mix these two things. The ConceptDomains are a set of higher level ideas that describe the 'domain' that a current concept belongs to. Like 'acceleration' being part of physics - but also solid body mechanics. There are many concepts that belong to a multitude of domains. It is tempting to try and build a hierarchy here, or some other scheme, where our names also indicate the 'domains', but I don't think that's a good idea. I see these as two separate 'graphs' that give relationships between all of our data.

balacij commented 2 years ago

In the web world, URIs are a 'solution' to provide addresses for resources. Related are URNs. Although URIs and URNs are encoded as strings, they possess a grammar so that their parts can be recovered (by an easy parse). The problem with that is such a scheme means that the names of 'inner' items either need to be restricted (to avoid 'code' characters) or these need to be escaped. This is useful for external transmission - but since UIDs are completely internal, we are free to design out own Abstract Data Type.

Of course, I like this idea a lot!

My first thought is that the ADT should denote the formation and possibly "change history" of the knowledge/chunks they are given to. What do you think?

Part of the problem is that we have derived information, for which we don't have a good scheme that generates names for them. For example, we could have many things have the same 'name', but if they had different 'types', and those types were part of our own UID/URI/URN scheme, then they would not be the same anymore. This is what the appending hack tries to fake.

Yes, exactly.

Do you think we should also be considering refinements here? Specifically, are the ways we can refine chunks generically the same as the ways we can derive information? I imagine this bucket of generic ways would include discarding data/fields, appending new fields, and altering fields with some new information.

While I imagine there would be multiple ways can derive information, I'm a bit confused by how we should have information "originally" enter a knowledge base. Do we want to treat all information as being derived from "empty" knowledge (e.g., a unit constructor)?

Re: ConceptDomains: Thank you, that clears up my misunderstanding very well. I completely understand why they shouldn't be mixed now. Especially with your discussion of contexts, I can see why you say that we should go against the hierarchy scheme -- it is very likely to cause problems.

JacquesCarette commented 2 years ago

My first thought is that the ADT should denote the formation and possibly "change history" of the knowledge/chunks they are given to.

If by "formation", you mean the context in which they were formed, then yes. We should proceed very carefully - we don't want to stuff too much information in the UIDs themselves. We have to remember their main purpose: as a unique identifier. Any functionality beyond that should go elsewhere.

Do you think we should also be considering refinements here? Specifically, are the ways we can refine chunks generically the same as the ways we can derive information? I imagine this bucket of generic ways would include discarding data/fields, appending new fields, and altering fields with some new information.

This feels a little too "operational" to me. We should look at our meaningful semantic operations, and not pay too much attention as to how they are realized in the current code.

While I imagine there would be multiple ways can derive information, I'm a bit confused by how we should have information "originally" enter a knowledge base. Do we want to treat all information as being derived from "empty" knowledge (e.g., a unit constructor)?

Excellent remark. Some information we want to treat as 'axioms', i.e. pre-existing before us. They are just pure statements about our 'world model'. Other information will indeed have a traceable origin, i.e. be constructed from axioms, by some kind of mechanism. We do, eventually, be able to track that completely.

But there is a lot of information that we will treat as axioms. Not that that information isn't highly organized and contain all sorts of relations to itself, just that its origin is 'given to us'.

balacij commented 2 years ago

If by "formation", you mean the context in which they were formed, then yes. We should proceed very carefully - we don't want to stuff too much information in the UIDs themselves. We have to remember their main purpose: as a unique identifier. Any functionality beyond that should go elsewhere.

My original thought was that we could create a unique identifier that tracks the construction of some data, but now I'm also starting to think of scenarios where this falls apart (e.g., refining some data in some way tracks the change through the UID, but if we have multiple refinements that can be performed in any order and end up with the same data, then the UID is not unique, so it fails!). So, it looks like this idea won't work well.

I see what you mean, we need to focus on it being a unique identifier, thank you!

This feels a little too "operational" to me. We should look at our meaningful semantic operations, and not pay too much attention as to how they are realized in the current code.

Yes, I see what you mean, and, I agree.

Regarding our semantic operations, which ones are meaningful? I imagine we'll be mainly focused on the constructors for chunks because these are the areas where we assign UIDs to chunks. The chunks we systematically construct from the contents of our ChunkDBs seem like they would be meaningful because they are all derived from a knowledge-base but attuned to some configurations. For example, we systematically create many of the chunk types outside drasil-lang and drasil-theory (e.g., drasil-gools chunks should almost all be derived [with exception for ODEs] from chunks that exist in drasil-lang and -theory, and drasil-builds Makefiles are generated w.r.t particular softifacts we are building).

Regarding the chunks that we place in our ChunkDBs but then also de-nest their inner chunks into the other various maps inside our ChunkDBs, I believe we are missing a few operations (else we have UID collisions when merging our various ChunkDB maps into a single one). For example, DataDefinitions inherit their UID from their related QDefinitions, QDefinitions inherit their UID from their related QuantityDicts, and QuantityDicts usually carry their UIDs from within an internally held IdeaDict but the UID is usually manually written for a QuantityDict at the moment. I imagine that we would want to intervene here, altering the UID of both the DataDefinitions and QDefinitions as they might contain information (e.g., terms) that conflict with the information from the underlying QuantityDict/IdeaDict.

Excellent remark. Some information we want to treat as 'axioms', i.e. pre-existing before us. They are just pure statements about our 'world model'. Other information will indeed have a traceable origin, i.e. be constructed from axioms, by some kind of mechanism. We do, eventually, be able to track that completely.

But there is a lot of information that we will treat as axioms. Not that that information isn't highly organized and contain all sorts of relations to itself, just that its origin is 'given to us'.

Thank you, that sounds perfect. Treating pure statements about the world as axioms makes the most sense to me.

JacquesCarette commented 2 years ago

Note: I've disliked our use of 'chunk' for > 3 years now. When we started ~7 years ago, it made sense, since our thinking was along the lines of "literate programming on steroids" (where we borrowed the term), but has changed so much that it now sounds funny.

Our 'chunks' are bits of knowledge. Unique bits of knowledge should have a unique "handle" to refer to them. The tricky part: some of our knowledge is built "incrementally", so some big thing (like a QDefinition) is the end goal, but it turns out that we have some of its constituent parts around too. And, of course, we have lenses that let us project out pieces from (say) a QDefinition.

That's really were we got ourselves into trouble. Not understanding the bits of knowledge that were "unique wholes" and those that are mere "parts of that whole". The collection of stuff (QuantityDict, IdeaDict, etc, etc) just makes that worse.

I still don't know how to deal with that. But that's the real problem: identifying the "knowledge atoms" while still being able to see the actual particles that it's made out of, without giving them too high a status. Does that make sense?

balacij commented 2 years ago

Note: I've disliked our use of 'chunk' for > 3 years now. When we started ~7 years ago, it made sense, since our thinking was along the lines of "literate programming on steroids" (where we borrowed the term), but has changed so much that it now sounds funny.

I think we've brought this up a few times in the past. I'm okay with using different terminology to refer to them, or renaming them entirely. I've also naturally tended towards saying any of "knowledge", "fragments", and "knowledge fragments" when referring to them.

I'm a bit confused by "status" here, but, other than that, I see the problem you're getting at. However, I'm not quite sure how we can proceed. It sounds like we need to re-evaluate our chunk hierarchy somehow to figure out which ones should have their own UID, but I'm not quite sure by what criteria. Since we have the lenses that let us project out pieces, do we really need all the incremental chunks used to build it up (the ones that it shares the same UID with)? I might be confused, but it almost sounds like we need an alternative method of building chunks entirely?

JacquesCarette commented 2 years ago

While I'm quite keen on "knowledge fragment", I'm also quite aware that "knowledge" has been mis-interpreted (and it's vague). So I'm still trying to figure out good terminology. I do like "fragment".

In your second paragraph, you've zoomed in on the problem exactly:

we aren't quite sure what the right network of fragments we want,
we're not quite sure how to build them up,
we're not quite sure how we end up actually using them.

Part of the idea behind the current scheme is that we can build "knowledge fragments" incrementally. Good question: is this useful? I'm pretty sure this is used quite a bit, but also 'not', in the sense that a lot of the examples build their fragments all at once.

Another part of the idea is that we can have knowledge fragments that are 'huge' but used in a setting that only need a fraction of that information, without having to worry. The lenses let us do that on the programming level; but also the UIDs let us know where that the "downward projection" of the information came from. That traceability is extremely important.

I am quite convinced that the 'downward projection' traceability is crucial. I am willing to believe that incremental build-up is not a necessary feature.

So your last question indeed hits it exactly right. It feels like the next 'move' would be to look at what "incremental building" actually looks like inside the Drasil code base right now (i.e. at its point of use, not the tools that let us do it). I think that's where we're going to learn the most.

balacij commented 2 years ago

While I'm quite keen on "knowledge fragment", I'm also quite aware that "knowledge" has been mis-interpreted (and it's vague). So I'm still trying to figure out good terminology. I do like "fragment".

Nice! I can see how just "knowledge" is confusing and misleading -- it might lead the reader to think of something larger than the specific parts we're focusing on when forming a fragment type.

I believe that building up knowledge fragments incrementally is a generally good and useful idea. We make good use of it in reuse too (though sometimes we do deep copies with changed UIDs).

However, the implicit relations we've formed and rely on (by sharing UIDs across the types we've composed to form larger types) have caused some problems in the past.

As of right now, our chunk/fragment types are built as product types of other fragments and "primitives" (e.g., Haskell primitives, and the special encodings we build, like Expr or Sentences). Generally, we only have the "smallest" chunks/fragments (and a few select "unique" fragments) carry a raw UID. The fragments that are compositions of other fragments will generally inherit the UID of one of them.

Taking an example, if we have A as a product of B and C where both B and C are both fragments, which of B and C should A inherit it's UID from?

If it's assumed that B and C have the same UID, then this shouldn't matter. However, it seems unlikely that B and C would share the same UID. If they both had the same type, then they should be the same data and A would just be a type alias. If they're typed differently, then they both expose different functionality and properties, which means that the UID shared between all 3 types would then denote some piece of data that has a dynamic type, which is very confusing, but can also work really well! If all 3 share the same UID, then you can only resolve the type of a UID by inspecting all maps in the ChunkDB and collecting all elements until we can form one single list of chunks that, together, give us everything we currently know about that UID. I believe this is approximately an entity-component system, and this is vaguely similar to how we use ChunkDBs when searching for the smaller parts of a fragment.
If B and C don't have the same UID, then we're likely bundling these two things together for some specific relation between the two. Thus, we are creating new information that is distinctly different from either B and C. In which case, the question seems silly because A should almost definitely have it's own UID, as it denotes a completely different (new) and independent piece of data.

Taking another example, if we have E and F, where E is a fragment and F is a primitive, should we be allowed to form D as a product of E and F? The reason why I ask is because I think primitives are a special case of fragments in that they're just UID-less. However, we've discussed this in the past and I think you made it clear that primitives should actually have UIDs; in which case, the primitive should be wrapped in a basic "primitive" carrier fragment (e.g., Primitive (a) = UID \cross a), and thus the question is answered the same as the above example.

Ultimately, I think this boils down to a very important design question: Should UIDs denote a single statically typed knowledge fragment? This isn't to say that a UID should have the Haskell-level type of the fragment exposed as a type parameter of UID (at the very least, "yet", but that could very well be a possibility later too). As of right now, I think we're somewhere between an additive-only entity-component-like system and a strong static type system for our fragments.

If we do want UIDs to denote a single statically typed fragment (as opposed to a list of fragments that collectively give it functionality and "meaning"), then I believe that all fragments should form their own unique UID (and, hence, never inherit the UID of another fragment).

balacij commented 2 years ago

Now, assuming we want UIDs to denote a single statically typed fragment, what should the UIDs look like?

I think the exact version of a UID that we want in Drasil is traditionally called a Rigid Designator, whereby the name/identifier refers to something precisely across all "worlds". This is as we discussed in the last meeting.

We as humans might use proper nouns to refer to rigid designators (such as "Heat", or "McMaster University"). Unfortunately, these terms can still be ambiguous!

Since we're using a computer to encode real world information, I think we can use the internal data that makes up an encoding, and the type name/designation of the encoding to form it's designator. This way, the designator is precisely the data itself and references become completely unambiguous. Additionally, we form an equality relation between all chunks.

Aside: If I've understood your comment regarding "downward projection" correctly, then I think this addresses your concern regarding downward projection because all data would be exposed through the UID (the UID would directly expose the internals, and the internals of the internals, and so on).

So, how can we practically form Rigid Designators for Drasil?

Practically, we might choose to encode this data using a JSON-encoded String, and wrap it in another JSON object which includes it's type designator. This might lead to data encodings that are ridiculously large, which we can mitigate by: (1) add caching via the lenses caching magic that Dr. Carette previously found a blog post on, and (2) hash the Strings to reduce memory footprint (I imagine that hash collisions wouldn't start occurring until we start hitting the millions or billions of fragments). Finally, this results in a "raw" uid which, I imagine, is only for debugging, and a "practical" UID (a hash).

EDIT: https://byorgey.wordpress.com/2021/09/17/automatically-updated-cached-views-with-lens/

This is the interesting blog post that Dr. Carette previously came across.

balacij commented 2 years ago

Finally, with those 2 above comments, and a few meetings ago where we briefly discussed how chunk building works, it somewhat seems like having a DSL for creating Drasil chunks/fragment types would be nice here.

JacquesCarette commented 2 years ago

To confirm: we want UIDs to denote a single fragment. Furthermore, we probably do want something that's part of the UID to carry the information of what is actually available from that fragment. In a sense, an encoding of what's valid to project out.

Another way to think of it: a UID should be the 'name' of exactly one thing. That thing might be an abstract concept, but it's still a thing.

Before things go sideways: some 'things' naturally are related (think of subtyping in programming). We don't want to use UIDs to carry that information too. If two things are related, then we'll create an explicit relation between them. Just because that relation might be 'the identity' (because it's an embedding) doesn't mean we want to erase the fact that there's a difference.

On to products: if A is a thing, and so is B then AxB is a different thing. It should have its own UID. It doesn't even matter if it turns out they are both the same -- 2 of a thing is different than 1 of a thing.

Note that if A is a theory and B is also a theory, then there are other operations, like 'amalgamated union', which will definitely create a new theory, but whose pieces will consist of parts of A and B mixes in non-trivial ways. But that's because it's an operation that's more like a union than a product.

JacquesCarette commented 2 years ago

Note that the above does not address certain things "head on":

incremental building
projecting
static types

Here, I'm going to explore that. So if part of a UID is a static type, then we can neither build incrementally nor project! This is because

the parts would be missing some of the items that the type says should be there, so we couldn't build that fragment,
the projections would similarly erase information that the type says should be there.

Having said that, it's not hopeless. This is because if we make the type part of the UID itself, then even though the raw 'name' that we use might stay the same, when we project if we also remove those forgotten items from the type, then the UID, as a whole, changes. So the pair (name, type) remains unique! Of course, this is where 'subtyping' really wants to kick in. The relation will be just too obvious, and we'll be sorely tempted to bake it in to the system. I think we should resist that temptation.

I'm still not sure if this all works. Naively, it seems like it will.

But I'm still concerned, because we have still not figure out what a 'knowledge fragment' really is!

balacij commented 2 years ago

Thank you, this is very helpful, and thank you for the corrections!

Regarding fully understanding what a knowledge fragment is, I get a vague sense that Drasil is on its way to becoming some sort of dependently typed specification/programming language. In which case, the UIDs with a static type as you discussed above would be "okay", but I'm not very confident in this.

JacquesCarette commented 2 years ago

I'd rather not go 'dependently typed' any earlier than necessary. I'm fine with having static checks that are like fake, low-power dependent types.

Let's design UIDs without that.

JacquesCarette commented 2 years ago

I started thinking about what is UID (link is to the philosophical approach I am taking now). That led me thinking about the requirements for URI/URNs. That bore fruit: RFC 1630 has some comments about the need for a universal syntax for URI/URNs, and Tim Berners-Lee's original proposal also has some nice motivation.

balacij commented 2 years ago

Some follow-up notes from the Friday meeting, specifically w.r.t UIDs:

UIDs are ~~used~~ placed everywhere, but they are used very seldomly.
- We use UIDs for testing sameness. E.g., Eq X usually relies on the UID.
- But when do we generally need to know when 2 chunks are equal?
- TODO (@balacij): Where do we actually check when 2 chunks are equal?
- We use UIDs to pass Chunks around by reference.
- Originally, this was done because of “references” because we want to refer to things that don’t exist yet. E.g., if we have a rigid designator (UID) for something before it exists, we clearly know what we wanted it, and as long as we have it before we actually need to look at it, we’re okay (e.g., thunks/lazy evaluation). In other languages, this might be similar to Promises.
- A bad reason, historically used, was a misplaced desire for efficiency.
- We need to observe what interface we need from it! Let’s assume that UID is abstract, what is the interface/features we wish it provided? This will tell us how to implement it, or that it’s possibly infeasible… TODO (@balacij)
- We’ll likely at least say that we need equality.
- We’ll likely want to treat it as a rigid designator.
URIs - There’s a good StackOverflow post about this (I would ignore the “accepted” answer, and scroll down just a bit). In particular, there's a link to a w3 post about UR*s. URIs (Uniform Resource Identifiers) are the largest encompassing kind of key->value mapping.
- URLs - A specific kind of URI where the Identifier is a Locator (Uniform Resource Locator). This describes the path one must take to get to something. It shows little information about “what kinds of values it expects”, but must start with a “protocol” designation that the path takes, which sometimes indicates what kinds of values can be expected.
- URNs - A specific kind of URI where the Identifier is a Name (Uniform Resource Name). It must start with a “urn”, and then there is a formalized type list which is used to create a uniform “name” for a resource.
- URCs - A specific kind of URI where the Identifier is the “Citation”/”Characteristics” (Uniform Resource Citations/Characteristics). These are intended to provide metadata about information. Unfortunately, it seems these were never implemented/formalized fully as URCs, and these seem to be the most interesting to us. “TimBL” (on w3) writes that these are related to knowledge representation as well.
The author behind URCs (John Kunze) seemingly originally called (and continues to) them “ERCs” (Electronic Resource Citations), and, at first glance, has a very active history related to identifiers. There’s a bit of history of this on wikipedia. There’s an original paper by Kunze that’s on the pub repo now, and there’s also another paper Kunze links to which is related to this (I haven’t fully gone through it yet). TODO @balacij

balacij commented 2 years ago

In https://github.com/JacquesCarette/Drasil/issues/2853#issuecomment-1034427454, @JacquesCarette describes how we transcribe chunks/knowledge fragments into Haskell. Along the way, he wrote an exhaustive list of the different kinds of knowledge we transcribe. If I'm understanding correctly, these would also be the different kinds of chunks. If so, this would likely affect how Chunks are encoded in #2873 (likely for the best, because we would be one large step closer to building a more idiomatic encoding of chunks, rather than how I left them [e.g., things that provide a uid and references to other things]). If we are able to get a higher-level encoding/understanding of our various chunks, would any of that new understanding trickle back here?

JacquesCarette commented 2 years ago

chunk = chunk of knowledge = knowledge fragment.

But that's too high-level to be useful. So we have 'different types of chunks', where we have some (meta-)knowledge of what they contain. Those are the parts that really matter the most. The 'basic knowledge' pieces are (non-exhaustively):

name
short name
space
UID
idea
named idea
uncertainty
symbol
reference
ref address
Definition
notes
citation
concept domain
abbreviation
constraints
reasonable value
unit symbol
units
defining expression
contents
derivation

Our chunks are then like creating a table in a database that has a specific collection of columns from that list. The 'type' associated to something like UncertainChunk is like the schema for that one table. It's implemented as a "nested record" (for convenience), but could have been done as a flattened table too. Insisting on using nested records is a mistake, basically because Haskell's record system won't let us do that. That's ok, our lenses take care of that.

Some collections of base knowledge recurs, and are important: that's why they get promoted into chunks. We could have used arbitrary collections of (key, value) pairs to do this too -- but we wouldn't get much help from Haskell in figuring out which code is doing something sensible, and which is just silly.