Where does information belong?

JacquesCarette commented 4 years ago

This issue has been central to Drasil for a long time, but now there is an issue (#2123) that is bringing parts of the problem back to the fore.

In part, SystemInformation is a big hack. We never defined what a 'System' is, so it's very hard to know if some information belongs there or elsewhere.

For example: authorship. A single example can contain multiple authors. One person might have written the (original) code, another the SRS, and yet a third might have written the description of the 'system' as a whole. We need to give proper attribution (this is indeed very important to science), so we need to define the entities within Drasil that are things whose authorship is possible.

muhammadaliog3 commented 4 years ago

I will be dividing this comment into the followiing sections

Code Investigation -- this time I explore systemInformation backwards, hopefully to get a different perspective. You also don't have to read it to understand the conceptual sections, it is more of a place to store my analysis.
Conceptual Problems
Conceptual (and high level code solutions)

Code Investigation

_sysinfodb -------- This is a map that contains almost all of a drasil example, it contains all symbols (except input and output), all concepts, all units, tracebility map, reference map, all of the data definitions in a single map, instance/theory/general definition map, conceptInstance map, section map, and labelled content map (which is figures but usually graphs).
_usedInfodb ------- I could not figure out what this was. What we do know is that this is a chunkDB with at max 2 fields filled out (atleast in the examples), a termMap and/or ConceptMap.
_refbyMap ----------- self explanetory
_constraints, _constants, _inputs, _outputs, _datadefs -------- self explanatory, they contain all the instances
_defSequence -------------- this is a list of the following
```
data Block QDefinition = Coupled QDefinition QDefinition [QDefinition] | Parallel QDefinition [QDefinition]
```
It is empty in some examples (such as Projectile) and filled in other examples (such as SSP) However it is not used within drasil-code or drasil-docLang. It could, therefore, be removed without issue. Or it could be something that is not implemented yet.
_concepts ------------- is always empty in all the examples and is not used within drasil-code or drasil-docLang
_definitions ------------- are empty list in some examples and in other examples are a list of QDefinitions from various places (theory+instance+genDefinition models, data definitions, etc.). In CodeSpec.hs they are used to get derived inputs incase there are to data definitions and then ultimately to create an execution order of code definitions.
_quants --------------- all symbols that are not input and not output symbols

_authors -------------- list of persons

data Person = Person { _given :: String
                 , _surname :: String
                 , _middle :: [String]
                 , _convention :: Conv
                 } deriving (Eq)

It could potentially contain more, such as telephone, email, institution

_sys ---------- self explanatory, but it is commonIdeaWithDict, hence it includes an abbrehviation, shorrt short and full title.
_kind ----------- self explanatory but it is any chunk with an idea, so for example srs. There could be an improvement here to allow more then just one kind of artifact, which will most definately happen in the future. Therefore it should be changed to a "[c]" where "c" is any chunk with an idea.

Potential new additions: _purpose (it should be a list of sentences), _configfiles (should be a list of files, it should probably be in some datatype such as type File = ConfigFile String )

Key findings: All the types seem to be find, except the authors should contain more then just author names and rather there emails, phones, and institutions, and kinds should not be restricted to just one value. _defSequence and _concepts, although conceptually they are useful there seems to be no practical use in any of the artifacts

Conceptual Problems

I think what got this issue started was that system information included so many different pieces of information, (each of different types and for different artifacts), it looked disorganized.

Requirements: Solution should make system information fields ordered, make system information fields grouped and make system information field types consistent.

Hence I will start by investigating some possible discriminators or clusters within system information:

Personal Information vs Non Personal information
Artifact generation choices (such as choices, kind ) vs Non-Artifact generation choices
Hard Science (symbols, derivations, equations, Models) vs Soft Science (concepts, example name, purpose)
Problem Space (Input, Output, purpose, Constraints) vs Solution Space (quantities, models, symbols)
Qualitative vs Quantitave

After we pick the appropiate groups we need to pick the appropiate way to divide the groups, here are some possiblities

make a different record for each group, that is split system information into 'n' records where 'n' is the number of groups
Make a record of records. I.e. make each of the groups a record within the system information, so system information will contain 'n' fields where 'n' is the number of groups. Each field will ITSELF be a record.
Leave system Information but split the groups with code COMMENTS
Leave most of system information the same but make the fields of contention (such as authors, purpose, kinds and name, like the fields that started this issue) a DATABASE within system information.

Some nice names I thought of: problemSpecifications/solutionSpecifications, personaInformation/systemInformation, SystemScience/(I couldnt think of the opposite ), systemChoices/systemSpecifications,

Conceptual (and high level code solutions)

I don't think we should have more then 3 groups as that could complicate things. I also don't think we should have a GOOL choices record, rather we should have it incorporated into "drasil choices"

I think that Hardscience vs (Softscience + Choices) would be the best split, along with a record of records, meaning we keep the same structure of keep one big record of "systemInformation" that contains all the raw chunk 'information'.

Hard Science: , _quants :: [e] , _definitions :: [QDefinition] --FIXME: will be removed upon migration to use of [DataDefinition] below , _datadefs :: [DataDefinition] , _inputs :: [h] , _outputs :: [i] , _constraints :: [j] --TODO: Add SymbolMap OR enough info to gen SymbolMap , _constants :: [QDefinition] , _sysinfodb :: ChunkDB , _usedinfodb :: ChunkDB

Soft Science: _sys :: a , _kind :: b , _authors :: [c] , _purpose :: d , _concepts :: [f] , _defSequence :: [Block QDefinition] , gool choices

JacquesCarette commented 4 years ago

Ok, commenting on each part. [Excellent investigation BTW]

Code Investigation

sysinfodb is indeed supposed to be a "database of all information about the system"
usedInfodb is supposed to be a "database of all the information that will be used", a subset of the above. It would be good to figure out where it is actually used
If I remember well, defSequence is suppose to define "sequences of definitions", either coupled or that can be done 'in parallel'. It's probably needed for something that was only partly implemented. I would look at PR #1664 and issue #287 for the origins of this.
I'm pretty sure that concepts, definitions and quants are all supposed to be used, but apparently the backend grabs its information in other ways!
authors: we don't want too much personal information here. It is true that it should probably be a list of "author identifiers" (like DOIs) rather than just names, but it's probably not urgent to fix that.
sys: I don't actually know why this exists separately from
kind: I think this is a bad name!

There are some problems I already see from the above analysis

we don't know for sure what the intent of each component is supposed to be. [The name and the use are not necessarily good hints either]
we still don't know what a "system description" is
clearly some things are under-implemented, while other things have suffered from bitrot

Conceptual Problems

It's deeper than just not knowing what each piece of information is supposed to mean, there is also some uncertainty as to what this is supposed to represent!

Your discriminators are excellent. They, de facto, partly answer the question: what are the kinds of information that we've found useful to have as part of the description of a system? What the categories say, as subtext:

people are involved in creating systems, and appropriate credit is important
systems involve making choices; but there are different kinds of choices, and these should not all be lumped together
some information is directly accessible to computer processing (what you call Hard Science) while other parts are more oriented towards humans (aka Soft Science, and more of ontological and pedagogical use)
that it is important to split Problem from Solution
qualitative vs quantitative, to me, is a repeat of the Hard/Soft split

I agree that splitting system information into a record-of-records is probably the way to go. The danger is that we refactor this over and over. If we uses lenses properly, it's not such a big deal. So it's probably ok if we don't quite get this right the first time.

Nevertheless, we should still think a little harder about "what is a system description". What are the ingredients that make that up.

Conceptual & Solutions

Here is my thinking about what is in "system information"

background knowledge pertinent to the problem
a definition of the problem
constraints that describe a "good" solution
structure of the solution
choices made in the solution
people involved in the creation of 1-5.

I put 1-5 in that order because, I think, they depend on each other in that order. 6 is different, in that it is more meta and applies to all.

I definitely doubt that's the end of the story. But it's a (re)start.

smiths commented 4 years ago

The discussion above from @muhammadaliog3 and @JacquesCarette is very helpful. I definitely like the approach of "reverse engineering" what the Drasil code says, but being wary because we know the ad hoc way some of the code was developed.

My instinct is that for step 5 above (choices made in the solution), we should divide the choices into at least two categories: choices related to the requirements (physical model) and choices related to the design (software structure, data structures, algorithms). I believe that most of our decisions are currently related to design, but when we get further with the notion of a family of programs, we will also have physics related variabilities. For instance, in GlassBR, we currently assume that LSF (load share factor) is 1.0, because we have only one pane of glass. We could remove this assumption and have another member of the family available.

@JacquesCarette during one of our group meetings last summer you brought up the concept of refinement. I think you had a different name for it though. The idea is that we have one model and then we make decisions and then we have a new model. I don't know the proper terminology, but I remember feeling that what you were presenting could give us a structure within which we could place our different ideas.

muhammadaliog3 commented 4 years ago

I think what is going on is that some things are obtained from using Map.elems on Map’s from CDB’s. In other words some times you just want all of the information rather then specific information, that is why we probably don’t need to store all the definitions/concepts…..

Here is a good example of how all of the general definitions are used by just using the general definition map.

  | t `elem` keys (s ^. gendefTable)          = makeRef2S $ gendefLookup      t (s ^. gendefTable)

Possible fields we could include in system information. NOTE: I know that the drasil philosophy is to only add things when they are needed, but I think that if we put some extra useful information in system information it would encourage other artifacts to reuse it in some way. This can hopefully spur some ideas.

IM/TM/GD/DD
Goals
Requirements
Purpose
Citations

Finally answering, what describes or defines a system (@smiths could help with this)

A small description, such as a name and purpose.
Problem statement, that defines constraints and input chunks with the proper units/types.
All of the relevant background information, such as sample input files, config files, and work from other people that are mentioned in the citations
Solution/output, which includes the desired chunks, their units/types, and constraints
Appropriate quantities, instance/thoery models etc. hard science that provides a way between the input and output
Appropriate justification for using the methods defines in “hard sciences”, such as theory models, derivations, concepts, assumptions
Specifies kinds of artifacts to present the solution and choices made in REPRESENTATION of the artifacts (choices made with regards to actual scientific content should belong in the justification section)
All the people involved
A possible split that I did not include has to do with separating the hard sciences. This split was separating defining symbols, defining equations, and solving equations. This is because these are often interwoven together so keeping them in multiple places could create confusion.
A possible section I did not include was “a system should be able to reference itself”, rather this should be included in the “presentation of the artifacts”.

Even more concretely a system defines:

Purpose+Name
Input
Output
Config files, Citations of relevant research work
Symbols, Quantities, constraints (if there are any), constants, units, datadefn table, instance models, theory models
Concepts, definitions, terms, concept instances, general definitions, assumptions
Kind of artifact, choices, traceability matrix and referencing to provide
Authors

Current system information storage strategy:

Store everything in a databases (sysinfodb), and add on individual fields to system information if they are needed. We should keep this strategy.
In order for information that is not used to not be included in the artifacts a usedinfodb is created (this however is only used for the table of acronyms). We should keep this strategy.
The choices are kept in separate record from the “systeminformation”. We should NOT keep this strategy.

NEW SYSTEMINFORMATION:

I am leaving out the types, but for new fields in system information I will mention the types
Each section can be subdivided or combined at the request of @smiths or @jacquescarette
Theory models and derivations could go in systemExplanattion, but I kept them in systemScience, if you want we could talk more about this
All of these section will be implemented as a record or records within system information

systemProblem

Title (this is not a new field it is the _sys field in the old SI), purpose (), inputs, outputs, constraints, config files (list of strings), citations

systemScience

Symbols, Quantities, constants, units, datadefn table, instance models, theory models
As of right now only the constants, quantities and data definitions are in SI hence only they will be in this section.

systemExplanation

Concepts, definitions, terms, concept instances, general definitions, assumptions, defSequence
As of right now only the concepts and definitions are in SI hence only they will be in this section

systemArtifacts

A record for GOOL choices, Traceability matrix, Document kind (this should be a list of concepts, currently I believe the only option we have is SRS)

systemAuthors

Authors

JacquesCarette commented 4 years ago

@smiths the kinds of refinement I was thinking of are "theory specialization", when you take a generic theory and instantiate some of its parameters to something more specific [and often then simplify the resulting model.]

@muhammadaliog3 Nice stab on "what describes or defines a system". @oluowoj this is the kind of speculating from observed data that we'd like you to be doing.

What you've provided is what I would call a really good first "brainstorming" of our current ingredients that describes/define a system. And you also gave a good first stab at categorizing that information. You are correct that all the information you list is part of a system description -- but we need to know the origin of each piece of information, what it means, whether it is human-specified or derived from some other basic information, etc. I need to try to circle back to this some time this week.

JacquesCarette commented 4 years ago

@smiths I think we should have a meeting about this. Might even make sense to have an all-hands?

smiths commented 4 years ago

Yes, a meeting is a good idea, but I won't be available tomorrow or next week. Next week is a vacation week, and tomorrow is the day where I have to get all my work done so that I can go on vacation. :-) I'll try to keep up on e-mail, but I won't be able to do a meeting until the week of August 4th.

balacij commented 2 years ago

By any chance, did the meeting (mentioned above) occur? If so, are notes public anywhere?

smiths commented 2 years ago

@balacij - No, I do not believe that a meeting took place.

JacquesCarette commented 2 years ago

It didn't. This is still an issue that is quite open.

balacij commented 2 years ago

Thank you, @smiths! :smile:

Hopefully we can continue this discussion soon. After reading Dr. Carette's changes to the Information Encoding wiki page, I think I understand a lot more thanks to having it formalized it and sitting in front of me. I've also had some thoughts related to this recently (albeit in a roundabout way -- analyzing our package structure). I just need to formalize them, and then I'll try to post my own thoughts too.

balacij commented 2 years ago

Ah, thank you as well, @JacquesCarette (I posted just 14s after you, but didn't see your comment until after I posted)! The latest changes to Information Encoding have been quite eye-opening.

smiths commented 2 years ago

Yes, we should discuss this issue again soon.

JacquesCarette / Drasil

Where does information belong? #2195

Finally answering, what describes or defines a system (@smiths could help with this)

NEW SYSTEMINFORMATION: