JacquesCarette / Drasil

Generate all the things (focusing on research software)
https://jacquescarette.github.io/Drasil
BSD 2-Clause "Simplified" License
136 stars 25 forks source link

Merging chunk maps into a single map via `Data.Typeable` #2873

Open balacij opened 2 years ago

balacij commented 2 years ago

Background

Presently, ChunkDBs are built as an enumeration of "chunk" maps; https://github.com/JacquesCarette/Drasil/blob/10f9f9ed0e332c7a31875f6e035db96549c45945/code/drasil-database/lib/Database/Drasil/ChunkDB.hs#L165-L183

Problem

i. We are unable to add (almost) arbitrary data into the ChunkDBs for various different systems. ii. We manually track chunks in lists (by a single type, for later further processing) in SystemInformation when we already have them registered in the ChunkDB. https://github.com/JacquesCarette/Drasil/blob/10f9f9ed0e332c7a31875f6e035db96549c45945/code/drasil-database/lib/Database/Drasil/SystemInformation.hs#L45-L53 iii. Adding arbitrary data to a ChunkDB is impossible without editing core drasil-database files, and difficult when editing core drasil-database files because it affects compilation of other examples. iv. Working with similar maps requires redundant manual recreation (swhs and nopcm has a bit of this). v. We have occurrences of conflicting UIDs because they are registered in different maps.

Potential Solution : Merging Maps into a single one

I think we can solve these 5 or so problems by merging our maps into a single Map, and using more of the core Data.Map functionality.

We will need to hide type information to create a homogeneous Chunk type, and then merge our maps into a single Chunk Map.

The solution would, approximately, be as follows: (1) Assume we have the following UID type, and typeclasses:

type UID = String

-- | An interface to get the UID of a chunk.
class HasUID a where
    uid :: a -> UID

-- | Dump data to Strings for debugging-purposes.
class DrasilDumpable a where
    dump :: a -> String

(2) We build a Chunk type intended to carry arbitrary data in our final ChunkDB:

import Data.Typeable -- this is an important design decision (as opposed to Type.Reflection) for simplicity
...
data Chunk where  -- Personally, I find GADTs to be more readable, but this can be done without them via ExistentialQuantification
    CHUNK :: (HasUID t, DrasilDumpable t, Typeable t) => t -> Chunk

instance HasUID         Chunk where uid  (CHUNK t) = uid t
instance DrasilDumpable Chunk where dump (CHUNK t) = dump t
...

Here, we build a commonality between all chunks registered in Drasil (e.g., they ALL must have a UID [a design decision], they all must be Typeable [Thanks to GHC 8.2+, this is the new 'normal'], and they all must be DrasilDumpable [not really required, but I think this would help in debugging -- this could also be something that we mass-dump into a single JSON file for our stable examples to show all raw knowledge we use to build up softifacts]). Additionally, we discard the real type information because the Typeable allows us to retrieve it later, and it will be helpful in creating a homogeneous map type.

(3) Thanks to capturing all of our chunks in a single Chunk type, we can construct a single homogeneous ChunkDB type:

type ChunkDB = M.Map UID Chunk

(4) We build a few functions to help with registering chunks into the ChunkDB:

-- Registering a list of like-typed chunks
chunksToChunkDB :: (HasUID t, Typeable t, DrasilDumpable t) => [t] -> ChunkDB
chunksToChunkDB = M.fromList . map (\x -> (uid x, CHUNK x))

-- Registering
registerChunk :: (HasUID t, Typeable t, DrasilDumpable t) => t -> ChunkDB -> ChunkDB
registerChunk c = M.insert (uid c) (CHUNK c)

-- Retrieval (I prefer using `Maybe` for prototyping instead of forced errors, but this would likely be a strict value or an error in practice)
retrieveChunk :: Typeable a => ChunkDB -> UID -> Maybe a
retrieveChunk cdb u = do
    (CHUNK r) <- M.lookup u cdb
    cast r

(5) When we want to pull arbitrary lists of chunks from the db by their TypeReps, we can via casting:

retrieveChunksByType :: (HasUID t, DrasilDumpable t, Typeable t) => ChunkDB -> TypeRep -> [t]
retrieveChunksByType cdb tr = relevantChunks
    where
        allChunks = M.toList cdb
        relevantChunks = mapMaybe (\(_, CHUNK c) -> if typeOf c == tr then cast c else Nothing) allChunks

NOTE: There is one problem here (which I go into small discussion into in my prototype), but it's not too important here. Essentially, the if typeOf c == tr then cast c else Nothing seems to be too strict of a requirement, and we might want to lessen to just a cast c (in which case, we can get rid of the then redundant TypeRep parameter), but I'm not entirely certain yet (I will have to do more research on this).

Of course, whenever we want to do this, it will be rather inefficient, so we can create a cache: (6) A Map of ChunkDBs by their TypeReps:

type ChunksByTypeRep = M.Map TypeRep ChunkDB

dumpChunkDBToTypeRepMap :: ChunkDB -> ChunksByTypeRep
dumpChunkDBToTypeRepMap cdb = allRegistered
    where
        -- gather a list of registered types
        knownChunkTypes :: [TypeRep]
        knownChunkTypes = nub . map (typeOf . snd) $ M.toList cdb

        -- gather types with a list of all registered chunks that are of that type
        chunksWithTypes :: [(TypeRep, [Chunk])]
        chunksWithTypes = map (\x -> (,) x $ retrieveChunksByTypeInChunkBox cdb x) knownChunkTypes

        -- reconcile everything found, into a single ChunksByTypeRep map
        allRegistered :: ChunksByTypeRep
        allRegistered = M.fromList $ map (second chunksToChunkDB) chunksWithTypes

This still might not be the right solution for us because it constructs the ChunksByTypeRepMap after having fully constructed a single ChunkDB, but we can easily create an alternative solution which registers chunks in both maps when registering them.

type ChunkDB' = (ChunkDB, ChunksByTypeRep)

-- There is one small problem here w.r.t uniqueness of UIDs and ensuring that the TypeRepMap is an accurate `view` of the `ChunkDB`
-- To fix it, we would just need to add an extra lookup to the initial chunkDB, removing it from the TypeRepMap manually before re-adding it to both maps (not a large problem, but I noticed it while typing up this ticket)
registerChunk' :: (HasUID t, Typeable t, DrasilDumpable t) => t -> ChunkDB' -> ChunkDB'
registerChunk' c (cdb, trcdb) = (cdb', trcdb')
    where
        u      = uid c
        chk    = CHUNK c
        cdb'   = M.insert u chk cdb
        trcdb' = M.alter
                    (maybe
                        (Just $ M.singleton u chk)
                        (Just . M.insert u chk))
                    (typeOf c)
                    trcdb

retrieveChunk' :: Typeable a => ChunkDB' -> UID -> Maybe a
retrieveChunk' (cdb, _) u = do
    (CHUNK r) <- M.lookup u cdb
    cast r

retrieveChunksByType' :: Typeable a => ChunkDB' -> TypeRep -> [a]
retrieveChunksByType' (_, trcdb) tr = maybe [] (mapMaybe ((\(CHUNK c) -> cast c) . snd) . M.toList) (M.lookup tr trcdb)

Does it solve all noted problems?

i. We are unable to add (almost) arbitrary data into the ChunkDBs for various different systems. Yes, so as long all all relevant data is typeable, has a UID, and (optionally) can be dumped to a String (DrasilDumpable).

ii. We manually track chunks in lists (by a single type, for later further processing) in SystemInformation when we already have them in nicely placed maps. Yes, we would have additional functionality for grabbing all chunks in a chunkdb by a particular TypeRep. However, we lose out on the ability of polymorphic types with type constraint restrictions, but it seems that we stick to a small group of types anyway, so this ability may not be needed.

iii. Adding allowed datatypes to a ChunkDB is impossible without editing core drasil-database files, and difficult when editing core drasil-database files because it affects compilation of other examples. Yes, we will be solving both problems here because the data registerable is nearly arbitrary. Through this, ChunkDB should become mostly stable and unchanging.

iv. Working with similar maps requires redundant manual recreation (swhs and nopcm has a bit of this). Yes, since ChunkDB is now a single type, we will be able to use Data.Map's abilities to allow for easier merging of ChunkDBs and such.

v. We have occurrences of conflicting UIDs because they are registered in different maps. Yes. By merging in a single map, we ensure uniqueness is upheld in UIDs.

In general, I think this would decrease LOC while allowing us to generally get more out of our ChunkDBs.

Notable issues

I can only think of 2 issues, but I'm unsure of how impactful they are.

  1. We're using Data.Typeable. It appears that this library is largely unstable pre-GHC-8.2, and previously was designed differently. Additionally, since we are discarding type information and later re-gaining it, we will have some amount of runtime impact, but I'm unsure of how much.
  2. Retrieving chunks en masse by a single TypeRep is tricky, but I think we can sufficiently manage it. In (5), I briefly mentioned this. However, with a bit of hard work, we can mitigate the overhead -- we should be able to create functions similar to makeLenses which hides the difficulty associated with creating TypeReps and casting chunks/lists of chunks.

I'm posting this ticket earlier rather than later because I think it's a good proof of concept, but it still has problems. I don't think the problems are too 'breaking', but it would be best to discuss the base before I spend more time on this design.

Prototype

I've made, and, lightly, tested this in my mini-development repo: https://github.com/balacij/ExprTests/blob/f7321d55611e7721a339188c79b66a3c6258ee8c/src/Lang/ChunkDB.hs

If you'd like to open up this code locally, I've currently setup stack run to run through a few small tests with this ChunkDB prototype.

balacij commented 2 years ago

I should also mention that the above code is just a prototype, I think we can refine it a bit more. In particular, I think we should be using newtypes for the ChunkDBs, and creating errors (either as an Either X String, or an actual plain error) where appropriate.

JacquesCarette commented 2 years ago

I think the technical ideas on the database side are sound. I'm going to want to work on the names of the functions provided by this, but that's not too hard. I am fine with the use of TypeRep and Typeable.

The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil. My feeling is that if we adopt a monadic style (specifically: using a State monad with a ChunkDB state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!

balacij commented 2 years ago

I think the technical ideas on the database side are sound. I'm going to want to work on the names of the functions provided by this, but that's not too hard. I am fine with the use of TypeRep and Typeable.

This sounds great!

The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil. My feeling is that if we adopt a monadic style (specifically: using a State monad with a ChunkDB state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!

I like the idea of using State for 'registering'. I will need to spend a bit of time on this.

I also briefly considered how this would impact explicit need for gathering chunks into lists and then 'registering' them, but I was left a bit confused.

My first thought on how this "should be done" is: we wouldn't have any "collection" of chunks from within the modules they're defined. Instead, in the main executable program, we would perform some sort of reflection on known modules from the lib folders and packages, and gather all ones in their scopes.

In Java, I would likely either use Java's Instrumentation API (Agents) to find the related modules at the "class loading time" (I'm not too sure what to call this; pre-runtime-loading?), or I would use Java's reflection lib (in the early stage of my programs runtime) to search for '@' Annotated definitions with a specific annotation class, and annotations that ensures the definition is kept for runtime usage & analysis (@Retention(RetentionPolicy.RUNTIME), and @Target(ElementType.TYPE)).

I suppose the equivalent for the Instrumentation API could be a GHC plugin. Haskell has it's own reflection (and template haskell), which, I think would have (at least) the same functionality as Java's.

Do you have any thoughts on how this second half should be done?

With the first method, we might end up with less "boilerplate annotation" code placed in each "data" module.

JacquesCarette commented 2 years ago

We really don't want to do reflection in Haskell. A lot of this stuff in Java is really meta-linguistic (i.e. outside the actual language). I want to stay within Haskell as much as possible, and within well-understood Haskell too.

So we need to think about the process of registration. Doing inspection on an SRS to find used stuff is good. But we also need to "declare" which things from the standard databases (i.e. data-drasil) we're going to use. I don't want that part to be done via deep black magic.

balacij commented 2 years ago

We really don't want to do reflection in Haskell. A lot of this stuff in Java is really meta-linguistic (i.e. outside the actual language). I want to stay within Haskell as much as possible, and within well-understood Haskell too. So we need to think about the process of registration. Doing inspection on an SRS to find used stuff is good. But we also need to "declare" which things from the standard databases (i.e. data-drasil) we're going to use. I don't want that part to be done via deep black magic.

Ah, ok. With your example, this makes a lot more sense to me now, thank you. I was thinking of a completely different direction!

Doing inspection on an SRS to find used stuff is good.

This sounds good to me.

But we also need to "declare" which things from the standard databases (i.e. data-drasil) we're going to use. I don't want that part to be done via deep black magic.

I wonder if we could be placing ChunkDBs inside ChunkDBs to indicate an imported "bundle of knowledge". Though, since UIDs for various chunks would be local to the ChunkDB they're placed in, we would need a new mechanism for UID generation/assignment. Manually written strings wouldn't quite cut it anymore because those strings would be local to the specific ChunkDBs they're intended to be placed in. I'm not quite sure how to resolve this yet (vaguely, it seems like we need to give ChunkDB context to chunk definitions when creating them). The sound of placing ChunkDBs within ChunkDBs sounds like a neat way of importing knowledge bundles however!

EDIT: With ChunkDBs placed inside ChunkDBs, we might only need to collect chunks into their databases once, and then when referencing chunks from other ChunkDBs, we would import the whole ChunkDB they originated from (in the system's ChunkDB), and then the chunk (from whatever chunk that relies on it). However, I would hope that references wouldn't require us to also manually note from which ChunkDB the chunk originates from. Through 'tree shaking' (haha), we can also get rid of all unused chunks from the imported ChunkDBs so that they don't impact SRS documents and code too much.

ASIDE: Would we also want to create an idea of "foreground" vs "background" knowledge within a system? This could likely be used by printers to know whether a piece of knowledge is relevant enough to be printed in it's printing context.

balacij commented 2 years ago

I think this ticket is also loosely relevant to #2195, specifically in understanding the role of SystemInformation.

JacquesCarette commented 2 years ago

I don't necessarily want to put databases inside databases, but I think that creating a "current database for this example" from a collection of other databases, that's definitely a good idea. Which is roughly what you arrived at in your EDIT.

As far as UIDs are concerned, we can learn from URIs there, and use "qualified" names as a means to indicate provenance.

On foreground vs background: yes, but. What I mean is that I really want creators of an artifact to have to think about what they want in their artifact, then make an active choice. Now, when some sets of choice emerge as common, we can definitely create short-hands that automate it. [In many ways, a lot of doc-lang arose from such a process.]

And yes, that other ticket is relevant. The point is that we need to better understand the development process that we want for using Drasil. Because we've been so deep in prototype mode, we were mostly just doing things. Now is a good time to do more design. This is in part what the white paper is a prelude for.

The big question that underlies #2195 is just that: what's the end-to-end development process of a Drasil example? I mean, the process that we envision people should be using, when Drasil is able to support it.

balacij commented 2 years ago

I don't necessarily want to put databases inside databases, but I think that creating a "current database for this example" from a collection of other databases, that's definitely a good idea. Which is roughly what you arrived at in your EDIT. As far as UIDs are concerned, we can learn from URIs there, and use "qualified" names as a means to indicate provenance.

The main thing that I like about putting databases inside databases is that we get unique UIDs for cheap/free. However, it's quite complex in practice (duplicate and cyclical dependencies, & renaming UIDs from 'import's to be adjusted to it's new parent database), I haven't been able to come up with a design yet, and we'd likely need to be as careful as, if not more, general programming languages (the ones with support for modules & packages) in handling 'imports'. However, I might be thinking about this differently. Out of curiosity, do you think this is the main problem with databases in databases, or do you see a greater problem (or just other problems, I might just not be understanding how to design this at the moment)?

I think using qualified UIDs will mostly simplify the development. However, if we have conflicting UIDs for whatever reason, we might even see some Lens 'setter' usage to import specific modules with different UIDs to not cause breakage. Though, this might cause breakage in references from other imported chunks which depend on a re-UIDd chunk. I'm sure this will very rarely, if ever, occur, but, to mitigate it, it might be good for us to first build out the "concept domain" system you mentioned (to both analyze where breakage would occur, and fix it).

On foreground vs background: yes, but. What I mean is that I really want creators of an artifact to have to think about what they want in their artifact, then make an active choice. Now, when some sets of choice emerge as common, we can definitely create short-hands that automate it. [In many ways, a lot of doc-lang arose from such a process.]

Yes, I agree. I initially thought of coupling this kind of information with CHUNKs, but I quickly realized that it would be too specific to a specific definition of background and foreground (which might be different in each printer!). Thank you.

And yes, that other ticket is relevant. The point is that we need to better understand the development process that we want for using Drasil. Because we've been so deep in prototype mode, we were mostly just doing things. Now is a good time to do more design. This is in part what the white paper is a prelude for. The big question that underlies #2195 is just that: what's the end-to-end development process of a Drasil example? I mean, the process that we envision people should be using, when Drasil is able to support it.

I see. I'll need to think about this for a while. Thank you :smile:

balacij commented 2 years ago

I've just now realized that I completely neglected to mention why I looked into this. I've been looking into this because I need to be able to place arbitrarily parameterized types inside ChunkDBs and, potentially, SystemInformations. I will need to keep prototyping to see the extents of what will be possible, and how. However, I am quite happy with this progress for now. I still need to figure out how the State will work, but the rest mostly seems to make sense at least.

balacij commented 2 years ago

The thing that I am most curious about is how this will change (or enable change) in user code, as well as in data-drasil. My feeling is that if we adopt a monadic style (specifically: using a State monad with a ChunkDB state) for 'registering', the resulting code will likely be fine. This 'registration' would be what replaces the explicit gathering into lists that we currently have. This is an important "second half" of doing this!

I don't think I fully understand how/why we wouldn't need to gather into explicit lists anymore. Do you happen to have any examples of this?

JacquesCarette commented 2 years ago
register :: (HasUID c, TypeRep c) => c -> ChunkDB -> ((), ChunkDB)

can be seen as

do db <- get
     let db' = add c db
     put db'

aka

modify (add c)

So we could gather pieces of ChunkDB instead. And wouldn't have to worry about types (i.e. homogeneous lists).

balacij commented 2 years ago

Ah, ok. Thank you.

Would we then have 1 large [Chunk] from which we use to insert all the chunks, or would we have 1 area of code that just has modifys called one after the other? Sorry, I'm still a bit confused as to how we can avoid the "explicit gathering".

JacquesCarette commented 2 years ago

There will be explicit gathering in drasil-data. There can be implicit gathering from SRS specifications.

Also, we don't want to call modify directly, we want to have our own functions (that may be thin wrappers).

balacij commented 2 years ago

Do you have a design in mind for the functions? If so, would it be easiest to do this in my first pass at this ticket?

As of right now, for a first pass at this ticket, I'm assuming we're going to have a bunch of Chunk lists to replace the existing explicit lists.

Aside: I am working on this on the newChunkDB branch.

JacquesCarette commented 2 years ago

No, I don't have such a design. I don't think it would be easiest to do it at the same time, as it's going to have serious repercussions for the examples.

balacij commented 2 years ago

Thank you. In that case, I will come back to it after a first pass.

Aside: While working on this, I realized that ChunkDB -> ChunkDB transformers/printers sound interesting. We seem to have many combinators for things that perform a small change and give an updated variant of their input (e.g., Sentences, NPs, eventually Theories, etc). This type of transformer/printer is, approximately (or at least has a generally similar idea as), an example of the "database-level" version of the "composable printers" I was discussing in (1) of "Slightly deeper, but still fairly shallow, observations, and discussion" of #2883. I imagine we could, at least, systematically derive and add more chunks here.

balacij commented 2 years ago

union is a fun example of a ChunkDB -> ChunkDB. Import other ChunkDBs!

... though, we would need a unionWith under-the-hood to uphold UID uniqueness :smile:

balacij commented 2 years ago

@JacquesCarette Do you know if there's any easy ways to debug <<loop>> errors?

The existing code on newChunkDB has some of the case studies giving me <> errors.

JacquesCarette commented 2 years ago

Well, <<loop>> typically occurs when you have a definition in terms of itself, in ways that require information that is not available yet.

balacij commented 2 years ago

Thank you. I think I will need to break this up a bit so that I can tackle the problems a bit more gradually. Something's going awry, and I'm struggling to find just where exactly it is.

Approximately, the next few steps are:

  1. Replace all Chunk UID inheritances via "^. uid" and replace them with altered versions.
  2. Understand what's going on with the high amount of nw usage near ChunkDB creation -- it seems like it should be done elsewhere, I'm not quite sure why we're pushing out things there.
  3. Understand why we have manually set, hardcoded, and empty UIDs
  4. Understand Trace vs Reference (vs Reference [e.g., Document Reference]), the comments are a bit difficult for me to read, but it seems like this system is built as a chunk where it's own UID is also used as the target object referenced UID. This would mean that if (1) goes well, this would still be a persistent problem. It would need to have a "target" UID stored as well.
balacij commented 2 years ago

The latest working version of this (and the one which I think we should ultimately continue this based on) is posted on a fresh repo I made specifically for testing out typing the expression language.

JacquesCarette commented 2 years ago

Thanks. Is there an 'action item' for me at this point?

balacij commented 2 years ago

I'm not too sure there's much actionable for anyone on this ticket at the moment. The Haskell prototype worked well when I was using it. This ticket is currently blocked by, in some capacity, resolving #2911.

UID collisions would occur if we were to merge the maps today. I started changing some UIDs (too early!) in #2904, but we then moved it back into discussion in #2911, which is where we're currently blocked. So, I think #2911 is the one we need to give our attention to the most.

balacij commented 1 year ago

Before code gets lost, I posted the latest copy of the "merged ChunkDBs" to an external repository: https://github.com/balacij/ProtoChunkDB

There is a branch where I started the conversion of our ChunkDBs into the new ChunkDBs, but the progress is largely halted due to issues with UID "conflicts" (continued in #2911).

balacij commented 1 year ago

One future work idea: "chunk holes." Think about "type-hole programming" but with the data entered into chunks. These are chunks that we might always prefer to build "later on" in the development of a case study (e.g., I think we did this with ODEInfo for ODEs before @cd155 came along?) or for creating placeholders during the development of things, without creating temporary/mock data.