So I'm going to present the idea for our event detector. Briefly I think we should use some clustering algorithm on the timestamped edits we have. I’m explaining below how we could do it.
Explanation
The underlying assumption is that edits' date are generated by the same distribution with different parameters for each event.
We model the time at which an edit appear as a random variable Ed with density f_event. Our data is then the realization of the random variables Ed1, Ed2, …, Edn which are coming from a mixture of f_event.
In our case, the density function f should have three parameters: r, d and t0
r is the reaction rate. It describes how reactive people are at modifying a page. Intuitively, this parameter depends on the wikipage’s activity (i.e. update rate) and the « magnitude » of the event
d is the decay rate. It describes how fast a wikipedia page converges to stability after an event occurs. Intuitively it should be linked to the wikipage’s activity
t0 is the event’s date
Knowing our density function (here I mean the family, e.g. Gamma, Normal) we could find an EM algorithm that performs the clustering on our data points.
To find how many events are in a specific time frame we could do the following:
results = []
# Arbitrary range. It should scan a range where we think the true number of events lies
For i in range(1, 100):
data_with_clusters = cluster_algorithm(number_of_clusters=i, data)
results.append(data_with_clusters)
clusters = find_most_likely_cluster(results)
I’m not entirely sure how to do find_most_likely_cluster but I quickly read some stuff and it seems to be possible.
Roadmap
What we need to do to perform our clustering:
Find a good distribution to model our data
Find the corresponding EM algorithm to cluster our data
Find a comparison function to choose the best clusters
So I'm going to present the idea for our event detector. Briefly I think we should use some clustering algorithm on the timestamped edits we have. I’m explaining below how we could do it.
Explanation
The underlying assumption is that edits' date are generated by the same distribution with different parameters for each event.
We model the time at which an edit appear as a random variable Ed with density f_event. Our data is then the realization of the random variables Ed1, Ed2, …, Edn which are coming from a mixture of f_event.
In our case, the density function f should have three parameters: r, d and t0
Knowing our density function (here I mean the family, e.g. Gamma, Normal) we could find an EM algorithm that performs the clustering on our data points.
To find how many events are in a specific time frame we could do the following:
I’m not entirely sure how to do
find_most_likely_cluster
but I quickly read some stuff and it seems to be possible.Roadmap
What we need to do to perform our clustering:
I hope it’s clear. Ask me if it’s not.