Khalian / Modulo12

Sql Query Engine for Midi files
Apache License 2.0
53 stars 3 forks source link

Setup a bufferpool/cache #5

Open Khalian opened 3 years ago

Khalian commented 3 years ago

Needs some design work too.

It basically needs to satisfy these invariants

  1. A song on disk must have at most one copy in memory (this is due to an invariant that Modulo12 is a Read Only query language.

  2. A songs under a directory must be in the cache post a query's execution. On the first time it runs, the cache is created.

  3. The songs under sub directory must be addressible on their own

To illustrate points two and three, consider this hierarchy

Songs (Directory)
    - Led Zep (Directory)
    -     Stairway to heaven, Rain Song, Over the hills and far away
    - Greta Van Fleet
    -     Highway tune, Edge of Darkness, When the curtain falls

Cache(Songs) -> All the above songs Cache(Led Zep) -> { Stairway to heaven, Rain Song, Over the hills and far away } Cache(Greta Van Fleet) -> { Highway tune, Edge of Darkness, When the curtain falls }

  1. Cache must be invalidated on size and eviction on recency (least recently used, or least frequently used?). May consider TTL. Ideally these params should be configurable in some kind of spec file.
Khalian commented 3 years ago

Some initial thoughts

We need a tree data structure here, since we have to deal with hierarchies, consider this ADT

sealed Trait DirectoryTree
object DirectoryNode {
     case class SongFile(song: Song) extends DirectoryTree
     case class DirectoryNode(folderPath: String, filesUnder: List[Directory]) extends DirectoryTree
}

This will work for the basic cases. Imagine we have this folder hierarchy

Songs (Directory)
    - Led Zep (Directory)
    -     Stairway to heaven, Rain Song, Over the hills and far away
    - Greta Van Fleet
    -     Highway tune, Edge of Darkness, When the curtain falls

Well the root node would be the songs as a diretory node, and then files under would be led zep, and greta van fleet directory node. for led zep, it would have 3 song files, and the same for greta van fleet.

There are multiple things to be very careful about

  1. If we have a directory above Songs, we will have to re parent the entire tree.
  2. If we query a completely different directory that is not in the parent/child hierarchy of songs (before root), then we essentially end up with two such trees. We are looking at a forest.
  3. Files can be added AFTER we build the cache, so we can have a cache coherence issue. There are ways around it (like do a file system walk after some time T threshold despite having a cache, and load the songs that are new). Regardless this is a trade off that we need to think about.
  4. It would also help to keep it in a map (basically a linked hash map). So we can have O(1) look ups for songs in case our query engine needs it.