Filter events based on lifecycle_id logic

ablack3 commented 5 years ago

Thanks so much for this helpful R package! How would you suggest cleaning up events based on the lifecycle_id variable?

For example suppose I have an activity that should always have a "start" and "complete" event. However my event log is a bit messy and occasionally a case has only a "start" or only a "complete" event but not both. How would you suggest I filter activities to ensure that every activity has a "start" and "complete" event?

This seems similar to what the filter_precedence function does but I want filter events based on the ordering of the lifecycle_id within each activity.

I can create a reproducible example if that would be helpful.

gertjanssenswillen commented 5 years ago

Thanks for your message.

Currently there is no direct function to do this, but it seems useful to create some specific lifecycle functionality.

For now, you could do it manually as follows.

log %>%
group_by_activity_instance() %>%
filter(any(lifecycle_column_name == "start"), any(lifecycle_column_name == "complete")) %>%
ungroup_eventlog()

Note that you need to use the actual column name of the start/complete column, and that this approach will be somewhat slow. I'll try to create faster and direct filtering functions as soon as I see fit.

ablack3 commented 5 years ago

Thanks @gertjanssenswillen. Just to give you more detail on the problem I'm trying to solve I have created a reproducible example with my solution. The problem is that I do not have activity_instance_id in my event log. I only have the activity, case_id, status, and timestamp. I also have some bad data that I would like to filter out. In the example below the bad data are activities with a "complete" status but no matching "start" status. I think I have solved the data cleaning issue by adjusting the view so that the activity becomes the case and the status becomes the activity. Then I am able to use the filter_trim function to trim activities so that they always have a start and complete status. Assigning an activity_instance_id seems tricky to me but I think I managed to figure it out in this simple example. I'm not sure if my approach is generalizable to more real world scenarios though. Is lack of activity_instance_id on event log data a common issue in this type of data analysis?


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(bupaR)
#> Loading required package: edeaR
#> Loading required package: eventdataR
#> Loading required package: processmapR
#> Loading required package: xesreadR
#> Loading required package: processmonitR
#> Loading required package: petrinetR
#> 
#> Attaching package: 'bupaR'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

log <- tibble::tribble(
~patient,     ~activity,      ~timestamp,            ~status, 
"John Doe", "surgery",    "2017-05-10 08:38:21", "complete", 
"John Doe", "surgery",    "2017-05-10 08:53:16", "start",    
"John Doe", "surgery",    "2017-05-10 09:25:19", "complete", 
"John Doe", "treatment",    "2017-05-10 10:01:00", "complete",    
"John Doe", "treatment",    "2017-05-10 10:01:25", "start",    
"John Doe", "treatment",    "2017-05-10 10:35:18", "complete", 
"John Doe", "surgery",    "2017-05-10 10:41:35", "start",    
"John Doe", "surgery",    "2017-05-10 11:05:56", "complete")  %>% 
  mutate(timestamp = lubridate::as_datetime(timestamp))

print(log)
#> # A tibble: 8 x 4
#>   patient  activity  timestamp           status  
#>   <chr>    <chr>     <dttm>              <chr>   
#> 1 John Doe surgery   2017-05-10 08:38:21 complete
#> 2 John Doe surgery   2017-05-10 08:53:16 start   
#> 3 John Doe surgery   2017-05-10 09:25:19 complete
#> 4 John Doe treatment 2017-05-10 10:01:00 complete
#> 5 John Doe treatment 2017-05-10 10:01:25 start   
#> 6 John Doe treatment 2017-05-10 10:35:18 complete
#> 7 John Doe surgery   2017-05-10 10:41:35 start   
#> 8 John Doe surgery   2017-05-10 11:05:56 complete

# filter out bad data (activities without all needed timestamps)
log2 <- log %>% 
  # mutate(activity_status = paste0(activity, "-", status)) %>% 
  simple_eventlog(case_id = "activity", activity_id = "status", timestamp = "timestamp") %>% 
  filter_trim(start_activities = "start", end_activities = "complete") %>% 
  select(patient, activity, timestamp, status, force_df = T)

print(log2)
#> # A tibble: 6 x 4
#>   patient  activity  timestamp           status  
#>   <chr>    <chr>     <dttm>              <fct>   
#> 1 John Doe surgery   2017-05-10 08:53:16 start   
#> 2 John Doe surgery   2017-05-10 09:25:19 complete
#> 3 John Doe treatment 2017-05-10 10:01:25 start   
#> 4 John Doe treatment 2017-05-10 10:35:18 complete
#> 5 John Doe surgery   2017-05-10 10:41:35 start   
#> 6 John Doe surgery   2017-05-10 11:05:56 complete

# create the activity instance id
log3 <- log2 %>% 
  mutate(start_time = if_else(status == "start", timestamp, lag(timestamp))) %>% 
  mutate(tmp = paste(start_time, patient, activity, sep = "-")) %>% 
  mutate(activity_instance_id = dense_rank(tmp))

print(log3)
#> # A tibble: 6 x 7
#>   patient activity timestamp           status start_time          tmp  
#>   <chr>   <chr>    <dttm>              <fct>  <dttm>              <chr>
#> 1 John D… surgery  2017-05-10 08:53:16 start  2017-05-10 08:53:16 2017…
#> 2 John D… surgery  2017-05-10 09:25:19 compl… 2017-05-10 08:53:16 2017…
#> 3 John D… treatme… 2017-05-10 10:01:25 start  2017-05-10 10:01:25 2017…
#> 4 John D… treatme… 2017-05-10 10:35:18 compl… 2017-05-10 10:01:25 2017…
#> 5 John D… surgery  2017-05-10 10:41:35 start  2017-05-10 10:41:35 2017…
#> 6 John D… surgery  2017-05-10 11:05:56 compl… 2017-05-10 10:41:35 2017…
#> # ... with 1 more variable: activity_instance_id <int>

# map to log for analysis
log4 <- log3 %>% 
  select(patient, activity, timestamp, activity_instance_id, status) %>% 
  mutate(resource = NA) %>% 
  eventlog(case_id = "patient", 
           activity_id = "activity", 
           timestamp = "timestamp", 
           activity_instance_id = "activity_instance_id", 
           lifecycle_id = "status",
           resource_id = "resource")

print(log4)
#> Event log consisting of:
#> 6 events
#> 1 traces
#> 1 cases
#> 2 activities
#> 3 activity instances
#> 
#> # A tibble: 6 x 7
#>   patient activity timestamp           activity_instan… status resource
#>   <chr>   <fct>    <dttm>              <chr>            <fct>  <fct>   
#> 1 John D… surgery  2017-05-10 08:53:16 1                start  <NA>    
#> 2 John D… surgery  2017-05-10 09:25:19 1                compl… <NA>    
#> 3 John D… treatme… 2017-05-10 10:01:25 2                start  <NA>    
#> 4 John D… treatme… 2017-05-10 10:35:18 2                compl… <NA>    
#> 5 John D… surgery  2017-05-10 10:41:35 3                start  <NA>    
#> 6 John D… surgery  2017-05-10 11:05:56 3                compl… <NA>    
#> # ... with 1 more variable: .order <int>
summary(log4)
#> Number of events:  6
#> Number of cases:  1
#> Number of traces:  1
#> Number of distinct activities:  2
#> Average trace length:  6
#> 
#> Start eventlog:  2017-05-10 08:53:16
#> End eventlog:  2017-05-10 11:05:56
#>    patient               activity   timestamp                  
#>  Length:6           surgery  :4   Min.   :2017-05-10 08:53:16  
#>  Class :character   treatment:2   1st Qu.:2017-05-10 09:34:20  
#>  Mode  :character                 Median :2017-05-10 10:18:21  
#>                                   Mean   :2017-05-10 10:07:08  
#>                                   3rd Qu.:2017-05-10 10:40:00  
#>                                   Max.   :2017-05-10 11:05:56  
#>  activity_instance_id      status  resource     .order    
#>  Length:6             complete:3   NA's:6   Min.   :1.00  
#>  Class :character     start   :3            1st Qu.:2.25  
#>  Mode  :character                           Median :3.50  
#>                                             Mean   :3.50  
#>                                             3rd Qu.:4.75  
#>                                             Max.   :6.00

^{Created on 2018-11-08 by the reprex package (v0.2.1)}

gertjanssenswillen commented 5 years ago

Hi Adam, sorry for the late reply.

Firstly, yes, missing activity instance id's are a common problem. You can create an artificial one using heurstics - e.g. after a "complete", the next occurence of the activity will be a different instance, or if the time between two events of the same activity is greater than a set amount, they are different instances.

Secondly, I will add some lifecycle-filters later today. The ones that come to mind immediately are

Filter specific statuses
Filter act. instances that have a list of statuses
Trim act. instances to specific statuses

These, however will mostly assume that there is an activity instance id. Are there any more filters you think are useful for the lifecycle?

Also, it might be a good idea to implement some of the heuristics for activity instance create I mentioned above (although that won't be for today)

gertjanssenswillen commented 5 years ago

The above-mentioned filters on lifecycle have been added.

gertjanssenswillen / edeaR

Filter events based on lifecycle_id logic #18