datalib / StatsCounter

Python's missing statistical Swiss Army knife
GNU General Public License v2.0
15 stars 4 forks source link

Overriding dict to sort tuples (when using tuples as keys) #5

Open rodricios opened 9 years ago

rodricios commented 9 years ago

The reason is simple. When counting the occurrence of some event, where the event can be a collection of things (think, "what's the probability of "a" and "b", or "b" and "a" occurring?), we want to be able to record either ("a", "b") and ("b", "a") as having occurred once.

So instead of:

{ 
    ("a", "b"): 1, 
    ("b", "a"): 1
}

We have:

{
    ("a", "b"): 2
}

Here's one possible implementation:

class TransformedDict(StatsCounter):
    """A dictionary that applies an arbitrary key-altering
       function before accessing the keys
       http://stackoverflow.com/questions/3387691/python-how-to-perfectly-override-a-dict
       http://stackoverflow.com/questions/2060972/subclassing-python-dictionary-to-override-setitem
       """

    def __init__(self, *args, **kwargs):
        #super().__init__(*args, **kwargs)
        self.update(*args, **kwargs)

    def __getitem__(self, key):
        return super(TransformedDict, self).__getitem__(self.__keytransform__(key))

    def __setitem__(self, key, value):
        super(TransformedDict, self).__setitem__(self.__keytransform__(key), value)

    def __delitem__(self, key):
        super(TransformedDict, self).__delitem__(self.__keytransform__(key))

    def __keytransform__(self, key):
        return key

    def update(self, iterable=None, **kwds):
        '''
        Overiding update (taken from http://code.activestate.com/recipes/576611/)
        One issue is that we sort the 4 times, when it should only be sorted twice
        See "BUG HERE" below
        '''        
        if iterable is not None:
            if hasattr(iterable, 'iteritems'):
                if self:
                    self_get = self.get
                    for elem, count in iterable.iteritems():
                        self[elem] = self_get(elem, 0) + count
                else:
                    dict.update(self, iterable) # fast path when counter is empty
            else:
                self_get = self.get
                for elem in iterable:
                    # BUG HERE
                    # Sort once 
                    key = self.__keytransform__(elem)
                    # self_get causes sort again 
                    self[key] = self_get(key, 0) + 1
        if kwds:
            self.update(kwds)

class OKDict(TransformedDict):
    """Ordered Key Dict"""
    def __keytransform__(self, key):

        if isinstance(key, tuple):
            print(type(key), key)
            return tuple(sorted(key))
        else:
            return key

>>> OKDict([('a','b'), ('b', 'a')])
OKDict({('a', 'b'): 2})
rodricios commented 9 years ago

cc: @eugene-eeo

eugene-eeo commented 9 years ago

Using sets may be a better idea.

rodricios commented 9 years ago

Sets do not appear to be hashable. Besides, sets do not maintain multiple occurrences of an event. Think of trying to keep track of pairs of coin flips:

{ 
    ('heads', 'heads'): 1,
    ('tails', 'tails'): 1
}
eugene-eeo commented 9 years ago

Oh alright. I was about to suggest a frozenset but meh. I guess it wouldnt be an issue though. Set comparison may be more expensive or just as expensive as tuple comparison, the second of which is O(n). My only worry is that it will slow down and degrade performance. Maybe we should make it another class? Like ChainDict?

rodricios commented 9 years ago

Hmm, I'm checking it now.

rodricios commented 9 years ago

Ok, so I checked out ChainDict, and it doesn't necessarily address what were going for, but it does have some interesting ideas. Besides that, I'm thinking we should make our class that is composed by elements from both defaultdict and Counter (or StatsCounter).

The end goal is that, with this class, we can essentially create so called "Markov Chains" (chains of conditional probability distributions) on the fly (ie. by simply accessing the n-th nested key:

The best example I can think of is that of using it to create a "sentence generator" or "autocomplete"

>>> dists['the']
{
    'tower': 20, 
    'city': 10,
    ...
} 

Just to clarity, the above code is equivalent to the following statement: "tower" occurs 20 times after "the", "city" occurs 10 times after "the", etc.

The above code example can also be seen as a "2-state Markov Chain" (AKA a list of conditional frequency distributions).

Here's a 3-state Markov Chain:

>>> dists['the']['tower']
{
    'was': 8, 
    'fell': 4, 
    ...
}

Rule of thumb for crossing over from the "n-state" Markov Chains (MC) jargon to an actual Python implementation:

So the above two examples are instances of nested dicts (chains of frequency distributions, Markov Chains) that have already been "trained" (as in, they have been key'ed and value'd to the n-th nesting).

"Training", ideally, should be easy to accomplish. I'll write more about this in a bit.

eugene-eeo commented 9 years ago

Okay so I've read up a little about Markov chains, and not to be nitpicking but aren't the values supposed to be probability distributions instead of frequencies? Nested states can be expressed nicely in directed graphs:

state1 -(0.5)-> state2 -(0.25)-> state3

I personally prefer graphs because:

  1. They tend to lend themselves to very elegant solutions, and very nice queries, e.g. "give me the most probable path". The engine will just have to hop across all the most probable (maximum) nodes.
  2. Can point to each other indefinitely and intuitively- though the same also be argued about nesting where you can nest indefinitely but it feels "hacky".
  3. I have worked with them before.

Ultimately it is a question of how heavy you want your library to be. If it is not desirable to include it in the StatsCounter "core" we can perhaps have a separate repository for this kind of thing.

eugene-eeo commented 9 years ago

Update:

class Chain(dict):
    def __init__(self, value, freqs=()):
        self.value = value
        dict.__init__(self, freqs)

    def __hash__(self):
        return hash(id(self))

    def to(self, *nodes):
        acc = 1
        prev = self
        for item in nodes:
            prob = prev[item]
            acc *= prob
            prev = item
        return acc

Kind of elegant:

>>> g = Chain(1)
>>> h = Chain(2)
>>> g[h] = 0.5
>>> h[g] = 0.1
>>> g.to(h, g)
0.05

The select method I was speaking about:

def select(self, func, maximum=100):
    memo = [self]
    this = self
    while maximum:
        dist = {this[k]: k for k in this}
        if not dist:
            break
        best = dist[func(dist)]
        memo.append(best)
        this = best
        maximum -= 1
    return memo
rodricios commented 9 years ago

Okay so I've read up a little about Markov chains, and not to be nitpicking but aren't the values supposed to be probability distributions instead of frequencies

You are correct :) (frequency distributions are unnormalized probability distributions - I consider them to be practically equivalent)

Nested states can be expressed nicely in directed graphs:

   state1 -(0.5)-> state2 -(0.25)-> state3

Yes, I agree that they can be expressed nicely as graphs, but this is mostly true in a graphical sense.

They tend to lend themselves to very elegant solutions, and very nice queries, e.g. "give me the most probable path". The engine will just have to hop across all the most probable (maximum) nodes.

"Most probable path" operation are a series of argmax calls, which StatsCounter allows for already.

Can point to each other indefinitely and intuitively- though the same also be argued about nesting where you can nest indefinitely but it feels "hacky".

Anything none vanilla is essentially a hack, at first. Tables are, for whatever reason, the de facto representation of probability distributions (look at pg. 26 of Kohler's book). Implementing a graph representation of tables is a step away from an elegant solution IMO (also refer to Kohler's book, she basically tries to present a graph representation of distributions, and well, it's a notoriously difficult book to read :/).

eugene-eeo commented 9 years ago

Agreed. Should we include the select method? It only makes sense to do so.

rodricios commented 9 years ago

Hmm, I believe so, but I would like us to implement select (or something similar) to be a part of the ordinary accessing-by-key usage.

For example, from the user's perspective, it should not be blatantly obvious to see a difference between a double-nested prob. dist. (3-state MC) and a single-nested prob. dist. (2-state MC).

By that I mean this, if you trained a 3-state MC, you should still be able to use it as a 2-state MC:

>>> # 3-state
>>> dists['the']['tower']
{
    'was': 8, 
    'fell': 4, 
    ...
}
>>> 
>>> # 2-state
>>> dists['the']
{
    'tower': 20, 
    'city': 10,
    ...
} 

Now, the above requirement (I'm calling it a requirement to officialize this feature, do you agree with that distinction?) can obviously be addressed by overwriting __repr__ and __get__.

The second attribute I would like to include is being able to transform a distribution using some querying language or preferably a lambda expression.

I mean this:

>>> dists
{
    'of': 0.20, 
    'the': 0.50, 
    'that': 0.10, 
    'from': 0.20
}
>>> # Returns a new (binary?) probability distribution
>>> dists.transform(lambda word, prob: word.startswith('t'))
{
    True: 0.60, 
    False: 0.40
}

In essence we can use this tool to construct more general probability distributions. The above allows us to answer questions like,

"Given Book X, or a freq./prob. distribution of words occurring in Book X, what is the probability that word picked at random, from Book X, will start with 't'?"

Here's a basic implementation of this:

# If there's a more appropriate word to describe what "transform" is doing, 
# let me know (ie. a term used in the industry)
def transform(dist, key):
    newdist = StatsCounter()

    for k, v in dist.items():
        newdist[key(k, v)] += v

    return newdist

>>> dist = StatsCounter({
    'of': 0.20, 
    'the': 0.50, 
    'that': 0.10, 
    'from': 0.20
})
>>> transform(dist, lambda word, prob: word.startswith('t'))
StatsCounter({True: 0.6, False: 0.4})
eugene-eeo commented 9 years ago

OK, seems good and lightweight to me.