Open balacij opened 2 years ago
I should also mention that the above code is just a prototype, I think we can refine it a bit more. In particular, I think we should be using newtype
s for the ChunkDBs, and creating errors (either as an Either X String, or an actual plain error
) where appropriate.
I think the technical ideas on the database side are sound. I'm going to want to work on the names of the functions provided by this, but that's not too hard. I am fine with the use of TypeRep
and Typeable
.
The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil
. My feeling is that if we adopt a monadic style (specifically: using a State
monad with a ChunkDB
state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!
I think the technical ideas on the database side are sound. I'm going to want to work on the names of the functions provided by this, but that's not too hard. I am fine with the use of TypeRep and Typeable.
This sounds great!
The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil. My feeling is that if we adopt a monadic style (specifically: using a State monad with a ChunkDB state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!
I like the idea of using State
for 'registering'. I will need to spend a bit of time on this.
I also briefly considered how this would impact explicit need for gathering chunks into lists and then 'registering' them, but I was left a bit confused.
My first thought on how this "should be done" is: we wouldn't have any "collection" of chunks from within the modules they're defined. Instead, in the main executable program, we would perform some sort of reflection on known modules from the lib
folders and packages, and gather all ones in their scopes.
In Java, I would likely either use Java's Instrumentation API (Agents) to find the related modules at the "class loading time" (I'm not too sure what to call this; pre-runtime-loading?), or I would use Java's reflection lib (in the early stage of my programs runtime) to search for '@' Annotated definitions with a specific annotation class, and annotations that ensures the definition is kept for runtime usage & analysis (@Retention(RetentionPolicy.RUNTIME)
, and @Target(ElementType.TYPE)
).
I suppose the equivalent for the Instrumentation API could be a GHC plugin. Haskell has it's own reflection (and template haskell), which, I think would have (at least) the same functionality as Java's.
Do you have any thoughts on how this second half should be done?
With the first method, we might end up with less "boilerplate annotation" code placed in each "data" module.
We really don't want to do reflection in Haskell. A lot of this stuff in Java is really meta-linguistic (i.e. outside the actual language). I want to stay within Haskell as much as possible, and within well-understood Haskell too.
So we need to think about the process of registration. Doing inspection on an SRS to find used stuff is good. But we also need to "declare" which things from the standard databases (i.e. data-drasil
) we're going to use. I don't want that part to be done via deep black magic.
We really don't want to do reflection in Haskell. A lot of this stuff in Java is really meta-linguistic (i.e. outside the actual language). I want to stay within Haskell as much as possible, and within well-understood Haskell too. So we need to think about the process of registration. Doing inspection on an SRS to find used stuff is good. But we also need to "declare" which things from the standard databases (i.e. data-drasil) we're going to use. I don't want that part to be done via deep black magic.
Ah, ok. With your example, this makes a lot more sense to me now, thank you. I was thinking of a completely different direction!
Doing inspection on an SRS to find used stuff is good.
This sounds good to me.
But we also need to "declare" which things from the standard databases (i.e. data-drasil) we're going to use. I don't want that part to be done via deep black magic.
I wonder if we could be placing ChunkDB
s inside ChunkDB
s to indicate an imported "bundle of knowledge". Though, since UIDs for various chunks would be local to the ChunkDB
they're placed in, we would need a new mechanism for UID generation/assignment. Manually written strings wouldn't quite cut it anymore because those strings would be local to the specific ChunkDB
s they're intended to be placed in. I'm not quite sure how to resolve this yet (vaguely, it seems like we need to give ChunkDB
context to chunk definitions when creating them). The sound of placing ChunkDB
s within ChunkDB
s sounds like a neat way of importing knowledge bundles however!
EDIT:
With ChunkDB
s placed inside ChunkDB
s, we might only need to collect chunks into their databases once, and then when referencing chunks from other ChunkDB
s, we would import the whole ChunkDB
they originated from (in the system's ChunkDB
), and then the chunk (from whatever chunk that relies on it). However, I would hope that references wouldn't require us to also manually note from which ChunkDB
the chunk originates from. Through 'tree shaking' (haha), we can also get rid of all unused chunks from the imported ChunkDB
s so that they don't impact SRS documents and code too much.
ASIDE: Would we also want to create an idea of "foreground" vs "background" knowledge within a system? This could likely be used by printers to know whether a piece of knowledge is relevant enough to be printed in it's printing context.
I think this ticket is also loosely relevant to #2195, specifically in understanding the role of SystemInformation
.
I don't necessarily want to put databases inside databases, but I think that creating a "current database for this example" from a collection of other databases, that's definitely a good idea. Which is roughly what you arrived at in your EDIT.
As far as UID
s are concerned, we can learn from URIs there, and use "qualified" names as a means to indicate provenance.
On foreground vs background: yes, but. What I mean is that I really want creators of an artifact to have to think about what they want in their artifact, then make an active choice. Now, when some sets of choice emerge as common, we can definitely create short-hands that automate it. [In many ways, a lot of doc-lang arose from such a process.]
And yes, that other ticket is relevant. The point is that we need to better understand the development process that we want for using Drasil. Because we've been so deep in prototype mode, we were mostly just doing things. Now is a good time to do more design. This is in part what the white paper is a prelude for.
The big question that underlies #2195 is just that: what's the end-to-end development process of a Drasil example? I mean, the process that we envision people should be using, when Drasil is able to support it.
I don't necessarily want to put databases inside databases, but I think that creating a "current database for this example" from a collection of other databases, that's definitely a good idea. Which is roughly what you arrived at in your EDIT. As far as UIDs are concerned, we can learn from URIs there, and use "qualified" names as a means to indicate provenance.
The main thing that I like about putting databases inside databases is that we get unique UIDs for cheap/free. However, it's quite complex in practice (duplicate and cyclical dependencies, & renaming UIDs from 'import's to be adjusted to it's new parent database), I haven't been able to come up with a design yet, and we'd likely need to be as careful as, if not more, general programming languages (the ones with support for modules & packages) in handling 'imports'. However, I might be thinking about this differently. Out of curiosity, do you think this is the main problem with databases in databases, or do you see a greater problem (or just other problems, I might just not be understanding how to design this at the moment)?
I think using qualified UIDs will mostly simplify the development. However, if we have conflicting UIDs for whatever reason, we might even see some Lens 'setter' usage to import specific modules with different UIDs to not cause breakage. Though, this might cause breakage in references from other imported chunks which depend on a re-UIDd chunk. I'm sure this will very rarely, if ever, occur, but, to mitigate it, it might be good for us to first build out the "concept domain" system you mentioned (to both analyze where breakage would occur, and fix it).
On foreground vs background: yes, but. What I mean is that I really want creators of an artifact to have to think about what they want in their artifact, then make an active choice. Now, when some sets of choice emerge as common, we can definitely create short-hands that automate it. [In many ways, a lot of doc-lang arose from such a process.]
Yes, I agree. I initially thought of coupling this kind of information with CHUNK
s, but I quickly realized that it would be too specific to a specific definition of background
and foreground
(which might be different in each printer!). Thank you.
And yes, that other ticket is relevant. The point is that we need to better understand the development process that we want for using Drasil. Because we've been so deep in prototype mode, we were mostly just doing things. Now is a good time to do more design. This is in part what the white paper is a prelude for. The big question that underlies #2195 is just that: what's the end-to-end development process of a Drasil example? I mean, the process that we envision people should be using, when Drasil is able to support it.
I see. I'll need to think about this for a while. Thank you :smile:
I've just now realized that I completely neglected to mention why I looked into this. I've been looking into this because I need to be able to place arbitrarily parameterized types inside ChunkDB
s and, potentially, SystemInformation
s. I will need to keep prototyping to see the extents of what will be possible, and how. However, I am quite happy with this progress for now. I still need to figure out how the State will work, but the rest mostly seems to make sense at least.
The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil. My feeling is that if we adopt a monadic style (specifically: using a State monad with a ChunkDB state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!
I don't think I fully understand how/why we wouldn't need to gather into explicit lists anymore. Do you happen to have any examples of this?
register :: (HasUID c, TypeRep c) => c -> ChunkDB -> ((), ChunkDB)
can be seen as
do db <- get
let db' = add c db
put db'
aka
modify (add c)
So we could gather pieces of ChunkDB
instead. And wouldn't have to worry about types (i.e. homogeneous lists).
Ah, ok. Thank you.
Would we then have 1 large [Chunk]
from which we use to insert all the chunks, or would we have 1 area of code that just has modify
s called one after the other? Sorry, I'm still a bit confused as to how we can avoid the "explicit gathering".
There will be explicit gathering in drasil-data
. There can be implicit gathering from SRS specifications.
Also, we don't want to call modify
directly, we want to have our own functions (that may be thin wrappers).
Do you have a design in mind for the functions? If so, would it be easiest to do this in my first pass at this ticket?
As of right now, for a first pass at this ticket, I'm assuming we're going to have a bunch of Chunk
lists to replace the existing explicit lists.
Aside: I am working on this on the newChunkDB
branch.
No, I don't have such a design. I don't think it would be easiest to do it at the same time, as it's going to have serious repercussions for the examples.
Thank you. In that case, I will come back to it after a first pass.
Aside: While working on this, I realized that ChunkDB -> ChunkDB
transformers/printers sound interesting. We seem to have many combinators for things that perform a small change and give an updated variant of their input (e.g., Sentences, NPs, eventually Theories, etc). This type of transformer/printer is, approximately (or at least has a generally similar idea as), an example of the "database-level" version of the "composable printers" I was discussing in (1) of "Slightly deeper, but still fairly shallow, observations, and discussion" of #2883. I imagine we could, at least, systematically derive and add more chunks here.
union
is a fun example of a ChunkDB -> ChunkDB
. Import other ChunkDBs!
... though, we would need a unionWith
under-the-hood to uphold UID uniqueness :smile:
@JacquesCarette Do you know if there's any easy ways to debug <<loop>>
errors?
The existing code on newChunkDB
has some of the case studies giving me <
Well, <<loop>>
typically occurs when you have a definition in terms of itself, in ways that require information that is not available yet.
Thank you. I think I will need to break this up a bit so that I can tackle the problems a bit more gradually. Something's going awry, and I'm struggling to find just where exactly it is.
Approximately, the next few steps are:
nw
usage near ChunkDB creation -- it seems like it should be done elsewhere, I'm not quite sure why we're pushing out things there.The latest working version of this (and the one which I think we should ultimately continue this based on) is posted on a fresh repo I made specifically for testing out typing the expression language.
Thanks. Is there an 'action item' for me at this point?
I'm not too sure there's much actionable for anyone on this ticket at the moment. The Haskell prototype worked well when I was using it. This ticket is currently blocked by, in some capacity, resolving #2911.
UID collisions would occur if we were to merge the maps today. I started changing some UIDs (too early!) in #2904, but we then moved it back into discussion in #2911, which is where we're currently blocked. So, I think #2911 is the one we need to give our attention to the most.
Before code gets lost, I posted the latest copy of the "merged ChunkDBs" to an external repository: https://github.com/balacij/ProtoChunkDB
There is a branch where I started the conversion of our ChunkDBs into the new ChunkDBs, but the progress is largely halted due to issues with UID "conflicts" (continued in #2911).
One future work idea: "chunk holes." Think about "type-hole programming" but with the data entered into chunks. These are chunks that we might always prefer to build "later on" in the development of a case study (e.g., I think we did this with ODEInfo for ODEs before @cd155 came along?) or for creating placeholders during the development of things, without creating temporary/mock data.
Background
Presently,
ChunkDB
s are built as an enumeration of "chunk" maps; https://github.com/JacquesCarette/Drasil/blob/10f9f9ed0e332c7a31875f6e035db96549c45945/code/drasil-database/lib/Database/Drasil/ChunkDB.hs#L165-L183Problem
i. We are unable to add (almost) arbitrary data into the ChunkDBs for various different systems. ii. We manually track chunks in lists (by a single type, for later further processing) in
SystemInformation
when we already have them registered in the ChunkDB. https://github.com/JacquesCarette/Drasil/blob/10f9f9ed0e332c7a31875f6e035db96549c45945/code/drasil-database/lib/Database/Drasil/SystemInformation.hs#L45-L53 iii. Adding arbitrary data to a ChunkDB is impossible without editing coredrasil-database
files, and difficult when editing coredrasil-database
files because it affects compilation of other examples. iv. Working with similar maps requires redundant manual recreation (swhs and nopcm has a bit of this). v. We have occurrences of conflicting UIDs because they are registered in different maps.Potential Solution : Merging Maps into a single one
I think we can solve these 5 or so problems by merging our maps into a single Map, and using more of the core
Data.Map
functionality.We will need to hide type information to create a homogeneous Chunk type, and then merge our maps into a single Chunk Map.
The solution would, approximately, be as follows: (1) Assume we have the following UID type, and typeclasses:
(2) We build a
Chunk
type intended to carry arbitrary data in our finalChunkDB
:Here, we build a commonality between all chunks registered in Drasil (e.g., they ALL must have a UID [a design decision], they all must be Typeable [Thanks to GHC 8.2+, this is the new 'normal'], and they all must be
DrasilDumpable
[not really required, but I think this would help in debugging -- this could also be something that we mass-dump into a single JSON file for our stable examples to show all raw knowledge we use to build up softifacts]). Additionally, we discard the real type information because the Typeable allows us to retrieve it later, and it will be helpful in creating a homogeneous map type.(3) Thanks to capturing all of our chunks in a single Chunk type, we can construct a single homogeneous
ChunkDB
type:(4) We build a few functions to help with registering chunks into the ChunkDB:
(5) When we want to pull arbitrary lists of chunks from the db by their
TypeRep
s, we can via casting:NOTE: There is one problem here (which I go into small discussion into in my prototype), but it's not too important here. Essentially, the
if typeOf c == tr then cast c else Nothing
seems to be too strict of a requirement, and we might want to lessen to just acast c
(in which case, we can get rid of the then redundantTypeRep
parameter), but I'm not entirely certain yet (I will have to do more research on this).Of course, whenever we want to do this, it will be rather inefficient, so we can create a cache: (6) A Map of ChunkDBs by their TypeReps:
This still might not be the right solution for us because it constructs the
ChunksByTypeRepMap
after having fully constructed a singleChunkDB
, but we can easily create an alternative solution which registers chunks in both maps when registering them.Does it solve all noted problems?
i. We are unable to add (almost) arbitrary data into the ChunkDBs for various different systems. Yes, so as long all all relevant data is typeable, has a UID, and (optionally) can be dumped to a String (
DrasilDumpable
).ii. We manually track chunks in lists (by a single type, for later further processing) in
SystemInformation
when we already have them in nicely placed maps. Yes, we would have additional functionality for grabbing all chunks in a chunkdb by a particular TypeRep. However, we lose out on the ability of polymorphic types with type constraint restrictions, but it seems that we stick to a small group of types anyway, so this ability may not be needed.iii. Adding allowed datatypes to a ChunkDB is impossible without editing core
drasil-database
files, and difficult when editing coredrasil-database
files because it affects compilation of other examples. Yes, we will be solving both problems here because the data registerable is nearly arbitrary. Through this,ChunkDB
should become mostly stable and unchanging.iv. Working with similar maps requires redundant manual recreation (swhs and nopcm has a bit of this). Yes, since
ChunkDB
is now a single type, we will be able to useData.Map
's abilities to allow for easier merging of ChunkDBs and such.v. We have occurrences of conflicting UIDs because they are registered in different maps. Yes. By merging in a single map, we ensure uniqueness is upheld in UIDs.
In general, I think this would decrease LOC while allowing us to generally get more out of our ChunkDBs.
Notable issues
I can only think of 2 issues, but I'm unsure of how impactful they are.
Data.Typeable
. It appears that this library is largely unstable pre-GHC-8.2, and previously was designed differently. Additionally, since we are discarding type information and later re-gaining it, we will have some amount of runtime impact, but I'm unsure of how much.makeLenses
which hides the difficulty associated with creating TypeReps and casting chunks/lists of chunks.I'm posting this ticket earlier rather than later because I think it's a good proof of concept, but it still has problems. I don't think the problems are too 'breaking', but it would be best to discuss the base before I spend more time on this design.
Prototype
I've made, and, lightly, tested this in my mini-development repo: https://github.com/balacij/ExprTests/blob/f7321d55611e7721a339188c79b66a3c6258ee8c/src/Lang/ChunkDB.hs
If you'd like to open up this code locally, I've currently setup
stack run
to run through a few small tests with this ChunkDB prototype.