Write up design ideas and specs

fingolfin commented 9 years ago

We discussed various plans for this package in Trondheim. I have notes on them, and still should clean them up and put them somewhere (probably in the Wiki of this GitHub project).

I am filing this issue, to make sure I don't forget.

markuspf commented 9 years ago

Could you by any chance get our notes up somehow? I'd like to push this package and get a usable version by the end of the next meeting.

fingolfin commented 9 years ago

Sure, and sorry for the delay :-(. Anyway, I wanted to clean this up, but it is not clear when I can dig myself out of teaching and other obligations enough. I thought I would find time for it today, but it again did not work out. Ah well. :-(.

So I'll just put my raw, unedited, horrible notes here. This also contains some other TODO notes, I'll just dump it all here.

* add benchmarks for various implementation of e.g. prio queues

* use autodoc, at least for the doc skeletons

* add an ordered dictionary, these are very useful for e.g. orbit computations.
  see also
    http://morepypy.blogspot.de/2015/01/faster-more-memory-efficient-and-more.html
  (in fact, I begun work on implementing that)

Deque
    pop_{back,front}    
    push_{back,front}
    {back,front}

Heap

   but what about "IncrementPriority" -> not for PQ, only Heap
   but how to implement in reality, so that it is both usable and efficient?
   what are use cases?
      -> perhaps Dijkstra's shortest path algo

    for this purpose, have a "MutableHeap", with ideas based on
    http://theboostcpplibraries.com/boost.heap:
    for these, Add returns a "handle" which can be used to later
    on refer to that particular object, and to modify its key

HashTable
    Add(ht, key, value),  \[\]\:\=
        put value in under this key, regardless of whether it is here
    Update(ht, key, value)
         -> update value, but throw error if key is not actually present
    AddUnlessAlreadyThere;
        this would be the third, and throw an error if the key is already prsent.
       we can't think of a good name, though. but perhaps nobod needs it,
       so let's leave it out for now

    LookupWithDefault( ht, key, defaultValue )

    Lookup(ht, key),  \[\] 
        should be an error for consistency for arrays:
            arr[5] gives an error if arr[5] is not bound
        thus
            hash["key"]  should return an error if "key" is not in the hash.

        if you really want it to return fail, you can do this via
            LookupWithDefault( dict, fail )

    count / find???  
     maybe 

    hasKey / contains  -> \in    
        but

  FUTURE: Syntax extension for GAP to create HASH tables easily, e.g.
    {  key => value, ... }
  or  
    {  key : value, ... }
  or whatever...

Stack
    push / pop  vs.  Add / Remove
    IsEmpty
    top / peek
    Size

Queue
    push / pop  vs.  Add / Remove
    IsEmpty
    front / peek   vs.   back / top
    Size

    AddFirst / RemoveFirst / First   ->   turn First into an operation
    AddLast / RemoveLast / Last are aliases for:
    Add / Remove / ????

    add Last / LastOp to GAP  !!

LinkedList ????
  maybe a deque implementation could be based on that...
  or on something more advance even...

OrderedSet
UnorderedSet
    TODO
    Q: Do we really need that, in addition to the Maps? They could in theory
       be slightly more efficient

OrderedMap / Dict / Dictionary (TAKEN) / AssociativeArray / Map (ambiguous?)
UnorderedMap / Dict
    Q: do we really need both ordered and unordered? In PyPy, they recently
    decided to always use an OrderedMap, because it ended up being always
    faster for them anyway.
    http://morepypy.blogspot.de/2015/01/faster-more-memory-efficient-and-more.html

x := UnorderedMap();
x := UnorderedMap(IsString);        # hint about key
x := UnorderedMap("key");        # hint about key

  might return an AVLTree, or a HashTable, or...

PriorityQueue
   might return a binary heap, or something else

   use \< to compare elements, or a custom isLess function
        alternatively, use a SortBy-like approach, with a function
        that takes an entry, and maps it to another, and then compares THESE
        with <

   Add  
   Remove  -> get element with highest priority acc

   Peek / PeekMax / or just use "Maximum"  (then Maximum needs to become an operation

   but what about "IncrementPriority" -> not for PQ, only Heap

----
iteration interfaces for all of these...

AsList
AsSet

------

Methods:

- Add / Remove for everything?

- "Peek" for everything????

- IsEmpty  for everything?

- Size?

- Capacity?

- Clear / Reset  to clear arbitrary of our contains

- Resize / Reserve  -- to reserve storage 

-----

obsolete data structure implementations in the GAP kernel we should get rid of:

- lierep.gi / .gd:  IsSearchTable
- NewDictionary

=======

provide a

  WeakKeyDict

look at

http://docs.julialang.org/en/release-0.4/stdlib/collections/

One last remark: I wanted to turn First into an operation so we could use it, but that PR is stalled and will not go in for now. So we may need a new plan. Or perhaps just get in a smaller change, which modifies the First global function to dispatch to an operation in the single arg case?

stevelinton commented 7 years ago

@fingolfin mentioned yesterday the idea of an hashmap which also preserves the order in which elements were added. As I understood it this essentially a PLIST to which new elements were added at the end, combined with a hashtable storing indices into the PLIST.

A few thoughts about this:

The way to present this at GAP level is probably as a PLIST with Add but no assignment and a super-fast Position method (and consequently also a \in method). That very neatly meets the needs of orbit algorithms.
If you don't plan to delete much then this approach may always be correct. In the hash table you store the index of the key-value pair (if you have values) and some bits of the hash value of the key. Depending on the size of the hash table, you could get all of that into 32 or 64 bits per entry for all but the most enormous tables. With linear probing and Robin Hood hashing, for instance, you will basically need just one cache line from the hash table plus one from the PLIST plus whatever you have to do to compare entries for almost all lookups.
Making one of these would replace the common idiom of sorting a PLIST when you have finished making it and before you start doing a lot of Position or \in tests on it. Provided you can find a hash function, this is strictly better.

fingolfin commented 7 years ago

@stevelinton The idea I described comes from here https://morepypy.blogspot.co.uk/2015/01/faster-more-memory-efficient-and-more.html

I also created a separate issue to keep track of ideas and plans for the hash map/set/table stuff, you may want to move your comments there, so we can find them again more easily later on: https://github.com/gap-packages/datastructures/issues/30

gap-packages / datastructures

Write up design ideas and specs #10