Open rodricios opened 9 years ago
cc: @eugene-eeo
Using sets may be a better idea.
Sets do not appear to be hashable. Besides, sets do not maintain multiple occurrences of an event. Think of trying to keep track of pairs of coin flips:
{
('heads', 'heads'): 1,
('tails', 'tails'): 1
}
Oh alright. I was about to suggest a frozenset but meh. I guess it wouldnt be an issue though. Set comparison may be more expensive or just as expensive as tuple comparison, the second of which is O(n). My only worry is that it will slow down and degrade performance. Maybe we should make it another class? Like ChainDict
?
Hmm, I'm checking it now.
Ok, so I checked out ChainDict
, and it doesn't necessarily address what were going for, but it does have some interesting ideas. Besides that, I'm thinking we should make our class that is composed by elements from both defaultdict
and Counter
(or StatsCounter
).
The end goal is that, with this class, we can essentially create so called "Markov Chains" (chains of conditional probability distributions) on the fly (ie. by simply accessing the n-th nested key:
The best example I can think of is that of using it to create a "sentence generator" or "autocomplete"
>>> dists['the']
{
'tower': 20,
'city': 10,
...
}
Just to clarity, the above code is equivalent to the following statement: "tower" occurs 20 times after "the", "city" occurs 10 times after "the", etc.
The above code example can also be seen as a "2-state Markov Chain" (AKA a list of conditional frequency distributions).
Here's a 3-state Markov Chain:
>>> dists['the']['tower']
{
'was': 8,
'fell': 4,
...
}
Rule of thumb for crossing over from the "n-state" Markov Chains (MC) jargon to an actual Python implementation:
So the above two examples are instances of nested dicts (chains of frequency distributions, Markov Chains) that have already been "trained" (as in, they have been key'ed and value'd to the n-th nesting).
"Training", ideally, should be easy to accomplish. I'll write more about this in a bit.
Okay so I've read up a little about Markov chains, and not to be nitpicking but aren't the values supposed to be probability distributions instead of frequencies? Nested states can be expressed nicely in directed graphs:
state1 -(0.5)-> state2 -(0.25)-> state3
I personally prefer graphs because:
Ultimately it is a question of how heavy you want your library to be. If it is not desirable to include it in the StatsCounter "core" we can perhaps have a separate repository for this kind of thing.
Update:
class Chain(dict):
def __init__(self, value, freqs=()):
self.value = value
dict.__init__(self, freqs)
def __hash__(self):
return hash(id(self))
def to(self, *nodes):
acc = 1
prev = self
for item in nodes:
prob = prev[item]
acc *= prob
prev = item
return acc
Kind of elegant:
>>> g = Chain(1)
>>> h = Chain(2)
>>> g[h] = 0.5
>>> h[g] = 0.1
>>> g.to(h, g)
0.05
The select
method I was speaking about:
def select(self, func, maximum=100):
memo = [self]
this = self
while maximum:
dist = {this[k]: k for k in this}
if not dist:
break
best = dist[func(dist)]
memo.append(best)
this = best
maximum -= 1
return memo
Okay so I've read up a little about Markov chains, and not to be nitpicking but aren't the values supposed to be probability distributions instead of frequencies
You are correct :) (frequency distributions are unnormalized probability distributions - I consider them to be practically equivalent)
Nested states can be expressed nicely in directed graphs:
state1 -(0.5)-> state2 -(0.25)-> state3
Yes, I agree that they can be expressed nicely as graphs, but this is mostly true in a graphical sense.
They tend to lend themselves to very elegant solutions, and very nice queries, e.g. "give me the most probable path". The engine will just have to hop across all the most probable (maximum) nodes.
"Most probable path" operation are a series of argmax
calls, which StatsCounter
allows for already.
Can point to each other indefinitely and intuitively- though the same also be argued about nesting where you can nest indefinitely but it feels "hacky".
Anything none vanilla is essentially a hack, at first. Tables are, for whatever reason, the de facto representation of probability distributions (look at pg. 26 of Kohler's book). Implementing a graph representation of tables is a step away from an elegant solution IMO (also refer to Kohler's book, she basically tries to present a graph representation of distributions, and well, it's a notoriously difficult book to read :/).
Agreed. Should we include the select
method? It only makes sense to do so.
Hmm, I believe so, but I would like us to implement select
(or something similar) to be a part of the ordinary accessing-by-key usage.
For example, from the user's perspective, it should not be blatantly obvious to see a difference between a double-nested prob. dist. (3-state MC) and a single-nested prob. dist. (2-state MC).
By that I mean this, if you trained a 3-state MC, you should still be able to use it as a 2-state MC:
>>> # 3-state
>>> dists['the']['tower']
{
'was': 8,
'fell': 4,
...
}
>>>
>>> # 2-state
>>> dists['the']
{
'tower': 20,
'city': 10,
...
}
Now, the above requirement (I'm calling it a requirement to officialize this feature, do you agree with that distinction?) can obviously be addressed by overwriting __repr__
and __get__
.
The second attribute I would like to include is being able to transform a distribution using some querying language or preferably a lambda expression.
I mean this:
>>> dists
{
'of': 0.20,
'the': 0.50,
'that': 0.10,
'from': 0.20
}
>>> # Returns a new (binary?) probability distribution
>>> dists.transform(lambda word, prob: word.startswith('t'))
{
True: 0.60,
False: 0.40
}
In essence we can use this tool to construct more general probability distributions. The above allows us to answer questions like,
"Given Book X, or a freq./prob. distribution of words occurring in Book X, what is the probability that word picked at random, from Book X, will start with 't'?"
Here's a basic implementation of this:
# If there's a more appropriate word to describe what "transform" is doing,
# let me know (ie. a term used in the industry)
def transform(dist, key):
newdist = StatsCounter()
for k, v in dist.items():
newdist[key(k, v)] += v
return newdist
>>> dist = StatsCounter({
'of': 0.20,
'the': 0.50,
'that': 0.10,
'from': 0.20
})
>>> transform(dist, lambda word, prob: word.startswith('t'))
StatsCounter({True: 0.6, False: 0.4})
OK, seems good and lightweight to me.
The reason is simple. When counting the occurrence of some event, where the event can be a collection of things (think, "what's the probability of "a" and "b", or "b" and "a" occurring?), we want to be able to record either
("a", "b")
and("b", "a")
as having occurred once.So instead of:
We have:
Here's one possible implementation: