Amaumaury / ada-2017

Repository for applied data analysis
0 stars 0 forks source link

Model for event detection #3

Open Amaumaury opened 6 years ago

Amaumaury commented 6 years ago

So I'm going to present the idea for our event detector. Briefly I think we should use some clustering algorithm on the timestamped edits we have. I’m explaining below how we could do it.

Explanation

The underlying assumption is that edits' date are generated by the same distribution with different parameters for each event.

We model the time at which an edit appear as a random variable Ed with density f_event. Our data is then the realization of the random variables Ed1, Ed2, …, Edn which are coming from a mixture of f_event.

In our case, the density function f should have three parameters: r, d and t0

Knowing our density function (here I mean the family, e.g. Gamma, Normal) we could find an EM algorithm that performs the clustering on our data points.

To find how many events are in a specific time frame we could do the following:

results = []

# Arbitrary range. It should scan a range where we think the true number of events lies
For i in range(1, 100):
    data_with_clusters = cluster_algorithm(number_of_clusters=i, data)
    results.append(data_with_clusters)

clusters = find_most_likely_cluster(results)

I’m not entirely sure how to do find_most_likely_cluster but I quickly read some stuff and it seems to be possible.

Roadmap

What we need to do to perform our clustering:

I hope it’s clear. Ask me if it’s not.