Closed ablack3 closed 5 years ago
Thanks for your message.
Currently there is no direct function to do this, but it seems useful to create some specific lifecycle functionality.
For now, you could do it manually as follows.
log %>%
group_by_activity_instance() %>%
filter(any(lifecycle_column_name == "start"), any(lifecycle_column_name == "complete")) %>%
ungroup_eventlog()
Note that you need to use the actual column name of the start/complete column, and that this approach will be somewhat slow. I'll try to create faster and direct filtering functions as soon as I see fit.
Thanks @gertjanssenswillen. Just to give you more detail on the problem I'm trying to solve I have created a reproducible example with my solution. The problem is that I do not have activity_instance_id in my event log. I only have the activity, case_id, status, and timestamp. I also have some bad data that I would like to filter out. In the example below the bad data are activities with a "complete" status but no matching "start" status. I think I have solved the data cleaning issue by adjusting the view so that the activity becomes the case and the status becomes the activity. Then I am able to use the filter_trim
function to trim activities so that they always have a start and complete status. Assigning an activity_instance_id seems tricky to me but I think I managed to figure it out in this simple example. I'm not sure if my approach is generalizable to more real world scenarios though. Is lack of activity_instance_id on event log data a common issue in this type of data analysis?
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(bupaR)
#> Loading required package: edeaR
#> Loading required package: eventdataR
#> Loading required package: processmapR
#> Loading required package: xesreadR
#> Loading required package: processmonitR
#> Loading required package: petrinetR
#>
#> Attaching package: 'bupaR'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following object is masked from 'package:utils':
#>
#> timestamp
log <- tibble::tribble(
~patient, ~activity, ~timestamp, ~status,
"John Doe", "surgery", "2017-05-10 08:38:21", "complete",
"John Doe", "surgery", "2017-05-10 08:53:16", "start",
"John Doe", "surgery", "2017-05-10 09:25:19", "complete",
"John Doe", "treatment", "2017-05-10 10:01:00", "complete",
"John Doe", "treatment", "2017-05-10 10:01:25", "start",
"John Doe", "treatment", "2017-05-10 10:35:18", "complete",
"John Doe", "surgery", "2017-05-10 10:41:35", "start",
"John Doe", "surgery", "2017-05-10 11:05:56", "complete") %>%
mutate(timestamp = lubridate::as_datetime(timestamp))
print(log)
#> # A tibble: 8 x 4
#> patient activity timestamp status
#> <chr> <chr> <dttm> <chr>
#> 1 John Doe surgery 2017-05-10 08:38:21 complete
#> 2 John Doe surgery 2017-05-10 08:53:16 start
#> 3 John Doe surgery 2017-05-10 09:25:19 complete
#> 4 John Doe treatment 2017-05-10 10:01:00 complete
#> 5 John Doe treatment 2017-05-10 10:01:25 start
#> 6 John Doe treatment 2017-05-10 10:35:18 complete
#> 7 John Doe surgery 2017-05-10 10:41:35 start
#> 8 John Doe surgery 2017-05-10 11:05:56 complete
# filter out bad data (activities without all needed timestamps)
log2 <- log %>%
# mutate(activity_status = paste0(activity, "-", status)) %>%
simple_eventlog(case_id = "activity", activity_id = "status", timestamp = "timestamp") %>%
filter_trim(start_activities = "start", end_activities = "complete") %>%
select(patient, activity, timestamp, status, force_df = T)
print(log2)
#> # A tibble: 6 x 4
#> patient activity timestamp status
#> <chr> <chr> <dttm> <fct>
#> 1 John Doe surgery 2017-05-10 08:53:16 start
#> 2 John Doe surgery 2017-05-10 09:25:19 complete
#> 3 John Doe treatment 2017-05-10 10:01:25 start
#> 4 John Doe treatment 2017-05-10 10:35:18 complete
#> 5 John Doe surgery 2017-05-10 10:41:35 start
#> 6 John Doe surgery 2017-05-10 11:05:56 complete
# create the activity instance id
log3 <- log2 %>%
mutate(start_time = if_else(status == "start", timestamp, lag(timestamp))) %>%
mutate(tmp = paste(start_time, patient, activity, sep = "-")) %>%
mutate(activity_instance_id = dense_rank(tmp))
print(log3)
#> # A tibble: 6 x 7
#> patient activity timestamp status start_time tmp
#> <chr> <chr> <dttm> <fct> <dttm> <chr>
#> 1 John D… surgery 2017-05-10 08:53:16 start 2017-05-10 08:53:16 2017…
#> 2 John D… surgery 2017-05-10 09:25:19 compl… 2017-05-10 08:53:16 2017…
#> 3 John D… treatme… 2017-05-10 10:01:25 start 2017-05-10 10:01:25 2017…
#> 4 John D… treatme… 2017-05-10 10:35:18 compl… 2017-05-10 10:01:25 2017…
#> 5 John D… surgery 2017-05-10 10:41:35 start 2017-05-10 10:41:35 2017…
#> 6 John D… surgery 2017-05-10 11:05:56 compl… 2017-05-10 10:41:35 2017…
#> # ... with 1 more variable: activity_instance_id <int>
# map to log for analysis
log4 <- log3 %>%
select(patient, activity, timestamp, activity_instance_id, status) %>%
mutate(resource = NA) %>%
eventlog(case_id = "patient",
activity_id = "activity",
timestamp = "timestamp",
activity_instance_id = "activity_instance_id",
lifecycle_id = "status",
resource_id = "resource")
print(log4)
#> Event log consisting of:
#> 6 events
#> 1 traces
#> 1 cases
#> 2 activities
#> 3 activity instances
#>
#> # A tibble: 6 x 7
#> patient activity timestamp activity_instan… status resource
#> <chr> <fct> <dttm> <chr> <fct> <fct>
#> 1 John D… surgery 2017-05-10 08:53:16 1 start <NA>
#> 2 John D… surgery 2017-05-10 09:25:19 1 compl… <NA>
#> 3 John D… treatme… 2017-05-10 10:01:25 2 start <NA>
#> 4 John D… treatme… 2017-05-10 10:35:18 2 compl… <NA>
#> 5 John D… surgery 2017-05-10 10:41:35 3 start <NA>
#> 6 John D… surgery 2017-05-10 11:05:56 3 compl… <NA>
#> # ... with 1 more variable: .order <int>
summary(log4)
#> Number of events: 6
#> Number of cases: 1
#> Number of traces: 1
#> Number of distinct activities: 2
#> Average trace length: 6
#>
#> Start eventlog: 2017-05-10 08:53:16
#> End eventlog: 2017-05-10 11:05:56
#> patient activity timestamp
#> Length:6 surgery :4 Min. :2017-05-10 08:53:16
#> Class :character treatment:2 1st Qu.:2017-05-10 09:34:20
#> Mode :character Median :2017-05-10 10:18:21
#> Mean :2017-05-10 10:07:08
#> 3rd Qu.:2017-05-10 10:40:00
#> Max. :2017-05-10 11:05:56
#> activity_instance_id status resource .order
#> Length:6 complete:3 NA's:6 Min. :1.00
#> Class :character start :3 1st Qu.:2.25
#> Mode :character Median :3.50
#> Mean :3.50
#> 3rd Qu.:4.75
#> Max. :6.00
Created on 2018-11-08 by the reprex package (v0.2.1)
Hi Adam, sorry for the late reply.
Firstly, yes, missing activity instance id's are a common problem. You can create an artificial one using heurstics - e.g. after a "complete", the next occurence of the activity will be a different instance, or if the time between two events of the same activity is greater than a set amount, they are different instances.
Secondly, I will add some lifecycle-filters later today. The ones that come to mind immediately are
These, however will mostly assume that there is an activity instance id. Are there any more filters you think are useful for the lifecycle?
Also, it might be a good idea to implement some of the heuristics for activity instance create I mentioned above (although that won't be for today)
The above-mentioned filters on lifecycle have been added.
Thanks so much for this helpful R package! How would you suggest cleaning up events based on the lifecycle_id variable?
For example suppose I have an activity that should always have a "start" and "complete" event. However my event log is a bit messy and occasionally a case has only a "start" or only a "complete" event but not both. How would you suggest I filter activities to ensure that every activity has a "start" and "complete" event?
This seems similar to what the filter_precedence function does but I want filter events based on the ordering of the lifecycle_id within each activity.
I can create a reproducible example if that would be helpful.