samm82 commented 1 year ago

TLDR: This issue basically boils down to the following questions - the first one being the main question and the other two to clarify my understanding of domains in Drasil:

1. Is there a real semantic difference between IdeaDict and CI? It seems like merging the two would collapse down many other chunks as well, which is good if desired (simpler) but is bad if not (could lose important semantic distinctions).

At what level should domains be attached to a chunk? It is noted here that domains shouldn't be added at the Idea level, but that is what CI does.
What conditions cause a chunk to require a domain(s)?

During #3086, I realized that IdeaDict and CI are quite similar, where the only difference is that CI is guaranteed to have an abbreviation and also has a domain(s)

https://github.com/JacquesCarette/Drasil/blob/09365d7d1f5dc9b54e84a8d64cb601184456ea0c/code/drasil-lang/lib/Language/Drasil/Chunk/NamedIdea.hs#L70-L75 https://github.com/JacquesCarette/Drasil/blob/09365d7d1f5dc9b54e84a8d64cb601184456ea0c/code/drasil-lang/lib/Language/Drasil/Chunk/CommonIdea.hs#L21-L28

Abbreviations

What is the point of differentiating between an idea with and without an abbreviation? In my mind, it would be easier to use one chunk that might have an abbreviation (Maybe String), but I know we are trying to move away from Maybes when possible. However, I don't know enough about lenses to know if there's another way around this. Is keeping IdeaDict and CI separate intentional? Should they be merged? It seems like there is some work being done on this, from this discussion and the issue about using Lenses instead of Maybes.

Domains

One quirk of CI is that it takes a domain(s) Even though it's been pointed out that domains shouldn't be added at the Idea level. However, ConceptChunk chunk that takes an IdeaDict also takes a domain(s), so the domain(s) are almost implicitly associated with the IdeaDict. Should the domains be added up a level by whatever chunk is using the CI instead of being in the CI itself? What conditions cause a chunk to require a domain(s)?

Interestingly, IdeaDict is also present in QuantityDict, which is present in ConstrainedChunk and UnitaryChunk, where the former is also present in UncertainChunk. None of these types are given domains, although they have similar chunks with domains (as shown below):

Chunk	Similar Chunk	Has Domain from...
`UncertainChunk`	`UncertQ`	`ConstrConcept`
`ConstrainedChunk`	`ConstrConcept`	`DefinedQuantityDict`
`UnitaryChunk`	`UnitalChunk`	`DefinedQuantityDict`
`QuantityDict`	`DefinedQuantityDict`	`ConceptChunk`

So all of these chunks get a domain from ConceptChunk, which is just a CommonConcept with a domain(s) and definitely (as opposed to Maybe) has an abbreviation, where CommonConcept is just a CI with a definition (Sentence).

JacquesCarette commented 1 year ago

These are excellent questions, and a really nice self-contained overview (and analysis) of the context behind the questions. @balacij and @peter-michalski , take note.

Let me try to give quick answers to the 3 questions:

Unfortunately, there isn't enough difference. One can use either for encoding something that has an abbreviation and an empty list of domains (though that would be a misuse of CI. See below for a longer/better answer.
This is a kind of deep 'knowledge encoding' question. When we create a new chunk, how much information do we have about it? How much information is required before we can create a chunk for it? We have not come up with a way to do that (yet).
When we'd like to be able to classify where the information belongs.

To go deeper: the 'fundamental knowledge' is not really at the Chunk level, it is what is inside chunks:

name
abbreviation
domains
term
symbol
space / type
constraints
reasonable value
unit
uncertainty
definition
notes
defining expression and so on.

Obvious question: are these really the 'fundamental knowledge'? We don't have a good answer to that. But it has been sufficient for us up to now.

So where do Chunks come in? Well, if you consider the above as atoms, then chunks are more like molecules, i.e. collection of atoms. Like molecules, some can arise, and some cannot. So there's "order" in how things assemble. The molecules that interest us are the ones that end up getting defined. This process was quite ad hoc; when we encountered a bunch of facts about a thing we were interested in that occurred in practice a bunch of times, we named it.

The classes that arise from that allow you to see two kinds of things:

particular atoms that make sense on their own
particular sub-molecules that make sense on their own

The underlying theory we should be using is that of Formal Concept Analysis (FCA). The attributes here would be "has information X in it", with X from the list above. Our Chunks are then the nodes of the lattice that occur in practice. Our classes help use navigate the lattice.

Note that there are other analytical techniques (including those listed on that wikipedia page) that might make sense for us to use. FCA just makes sense to me.

An understanding of FCA also makes it clear that using Maybe is a hack: a proper concept should have an exact list of attributes that it embodies.

To bring it full-circle: there is all sorts of knowledge that exists that is well-defined, but doesn't possess an abbreviation. So we can't make abbreviation manditory, as that would undermine our whole system. But when abbreviations exist, they should be used. From a pure programming point-of-view, that screams for Maybe, doesn't it? We've learned that this is not a good solution. It seems that a better solution seems to lie at the "knowledge retrieval" stage, where we can have functions that retrieve abbreviations if they exist, and our code should deal with the fact that abbreviations are not always present.

This whole discussion should probably end up somewhere more permanent than in an issue, at least when it gels.

What we need to do:

settle on an analysis technique for concepts,
list all the attributes we have
derive the concepts we need (by using co-occurence in our actual knowledge database)
give names to the concepts we've thus extracted
create data-structures for those concepts
create accessors for all that information.

In practice, although the above steps should be done ab initio, I'm quite confident that a lot of what we currently have will stay as is, or with minor modifications.

smiths commented 1 year ago

Great discussion! Toward the end of reading the post from @JacquesCarette I started thinking that we really should design this again from the ground up, without worrying about backward compatibility. I then read the last sentence from @JacquesCarette and saw (after verifying that ab initio means from the beginning :smile:) that we are thinking the same.

In our re-design, I really like the idea of starting from the fundamental knowledge that is inside the chunks. We should also try to brainstorm knowledge that we think will be relevant in the future. We won't be able to make a perfect prediction, but I'll start a list of brainstormed thoughts below. In my list, I won't worry about whether the knowledge will end up inside a chunk, or possibly be tracked in a different way.

the local symbol used to represent a quantity. I think we currently "bake" the symbol into quantity making it difficult to change, but symbols aren't universal; they can change. In some cases, symbols are changed to avoid clashes between conventions when different domains are mixed (for instance sigma is used both for standard deviation, stress and the Stefan Boltzmann constant). In other cases, symbols are changed because of author/community preferences.
unit system. We implicitly (I believe) assume SI for everything, but we will also want to be able to use imperial units.
rationale information. For example for constraints, we may want to include a rationale for the constraint. Our "detailed derivations" currently provide rationale information for how we combine theories and assumptions to come up with a new theory.
refinement traceability information. Many theories will depend on other theories for their justification (rationale).
theory pre-conditions. Conditions that will need to be true to invoke a theory. That is you can only use a theory if you can satisfy the pre-conditions. The pre-conditions will be assumptions.
theory post-conditions. The conditions that have to be true once a theory has been invoked.

samm82 commented 1 year ago

To bring it full-circle: there is all sorts of knowledge that exists that is well-defined, but doesn't possess an abbreviation. So we can't make abbreviation manditory, as that would undermine our whole system. But when abbreviations exist, they should be used. From a pure programming point-of-view, that screams for Maybe, doesn't it? We've learned that this is not a good solution. It seems that a better solution seems to lie at the "knowledge retrieval" stage, where we can have functions that retrieve abbreviations if they exist, and our code should deal with the fact that abbreviations are not always present.

To me, this still screams for Maybe. Let's say we have two concepts: "ordinary differential equation" and "formula". Only "ordinary differential equation" has an abbreviation ("ODE"). Then we can use the abbreviations Just "ODE" and Nothing. We can then define functions for different use cases: for example, a function for introducing a concept would output "ordinary differential equation (ODE)" and "formula", and a function for referring to a concept later on would output "ODE" and "formula". Perhaps at some point this could be boiled down to one function based on where in the document this function is called once our document traversal is more developed. Again, I may just be limited in my understanding of Lenses and there may be a cleaner way to do this.

JacquesCarette commented 1 year ago

There are two places where Maybe can be used:

in the data representation,
in what the data accessors return.

You're absolutely correct that "this" (example: accessing abbreviations) screams for Maybe. What we've done is use a data representation that encodes that. What I'm arguing is that I think we should let the accessors do that instead, i.e. have lenses that return a Maybe. So we'd have HasX classy-lenses and MayHaveX classy-lenses. We could have instances of MayHaveX for all sorts of things where we already know there is no X but where asking the question isn't silly. We do need to be careful to not implement MayHaveX where the question should not be asked.

From the point of view of our usage, lenses are just polymorphic getters. We want to be able to "get X" from some representation without caring how X is embedded in the data we've been handed, as long as we're promised that X is in there somewhere.

smiths commented 1 year ago

@samm82 has your question been answered? If so, can we close this issue?

samm82 commented 1 year ago

We could have instances of MayHaveX for all sorts of things where we already know there is no X but where asking the question isn't silly.

To me, it seems like having an abbreviation should be one of these MayHaveX classy-lenses. I don't really see the advantage of having a type that enforces the existence of an abbreviation, although a deeper investigation might need to be conducted. If we decide that attaching a domain at the Idea level is a code smell, then I propose we merge the two chunks. Otherwise, I think that keeping them separate makes sense: both would have a Maybe String for an abbreviation, and CI would also contain a list of domains (we could even make this more explicit by having a CI be an IdeaDict and a list of domains).

I'm noticing now that if we decide that storing abbreviation information only makes sense in the context of Maybes, then having a distinction between NamedChunks and IdeaDicts may not be useful, since an IdeaDict is just a NamedChunk with a maybe abbreviation, and we (should) know whether or not a quantity will have an abbreviation when we create it.

JacquesCarette commented 1 year ago

Reconstructing our thinking from ~6-7 (!!!) years ago, we noticed that many important 'concepts' (where I use the term informally) had a tell-tale sign that they were more important than others: they came with an abbreviation. This was, of course, purely an observation on the sample that we had. Though it does still seem to hold. Where we seemed to have made an error was to enshrine this in our data representation.

Taking a step back, it does seem odd to enforce the existence of an abbreviation. An abbreviation really is something that may exist.

We really do need to go back to the blackboard (perhaps even literally!) and revisit all our chunks (their contents, their name, their intent, their constructors). An in-person design meeting is likely needed.

JacquesCarette commented 1 year ago

About closing this issue: there's a lot of valuable information in this issue, which should either be in the wiki (best) or under a discussion topic, so it doesn't just disappear. It could be that it's already been transferred, at which time we can probably close this in favour of continuing the discussion elsewhere, but I'd rather not close until I'm sure the material won't "disappear from view".

samm82 commented 1 year ago

This content has been migrated to the new Chunk Observations wiki page and will be cleaned up and organization (and added to throughout my investigation with any information that is more general than just a few chunks).

JacquesCarette / Drasil

What is the distinction between `IdeaDict` and `CI`? #3196

Abbreviations

Domains