AutomatedProcessImprovement / log-distance-measures

Python package with event log distance and similarity metrics
Apache License 2.0
5 stars 1 forks source link

Implement directly-follows EMD #3

Closed david-chapela closed 1 year ago

david-chapela commented 1 year ago

Implement EMD metric to structurally compare two event logs.

  1. Compute the frequency of each trigram in the event log (three consecutive activity instances, e.g. ABD or ACD).
  2. Build a histogram with their frequencies, where each column is a trigram.
  3. Compare the histograms of each event log (warning: there may be trigrams in one histogram not present in the other) using the EMD without distance penalizing.
  4. Generalize algorithm to any window size.
marlondumas commented 1 year ago

When computing the trigrams, it makes sense to include the trigrams (NULL, NULL, A), (NULL, A, B) to represent the start of a trace, and the trigrams (X, NULL, NULL) and (X, Y, NULL) for the end of the trace. Otherwise, you will somehow lose information, particularly in the case of short traces. This trick, is akin to adding artificial start and end events when computing the DFG.