Reproduction of the research results of the paper "The Impact of Process Complexity on Process Performance: A Study Using Event Log Data” as part of a group project in Advanced Process Mining from University of Mannheim
Paper: Vidgof, M., Wurm, B., Mendling, J. (2023). The Impact of Process Complexity on Process Performance: A Study Using Event Log Data. In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds) Business Process Management. BPM 2023. Lecture Notes in Computer Science, vol 14159. Springer, Cham. https://doi.org/10.1007/978-3-031-41620-0_24
The following steps are necessary to replicate the results of the paper:
To recreate the CSV file saved in the Complexity Data repository by the authors follow the following steps.
Download all the original event log files used by the authors.
After that, unzip them and find the XES files. For example, event log file for BPI challenge 2017 is located in BPI Challenge 2017_1_all.zip/BPI Challenge 2017.xes.gz/BPIChallenge2017.xes
Clone the following repo to get the scripts necessary for reproduction https://github.com/MaxVidgof/process-complexity
It is possible that installed pm4py version is not the recent one. For older version of pm4py, Python scripts in the project process-complexity (provided by Paper authors) will throw errors.
Execute python -m pip freeze
to check the version of installed pm4py package. If the pm4py version is not latest uninstall the package and reinstall again.
python -m pip uninstall pm4py
python -m pip install -Iv pm4py==2.7.11.4
--save-csv
option. i.e. execute python .\monthly.py -f "D:\BPI Challenge 2017_1_all\BPI Challenge 2017.xes\BPIChallenge2017.xes" --save-csv
. This will create a file named BPIChallenge2017_metrics.csv. Generated csv file contains less number of columns than the original merged.csv file (in the project Complexity Data).python .\over-time.py -f "D:\BPI Challenge 2017_1_all\BPI Challenge 2017.xes\BPIChallenge2017.xes"
. Note that, --save-csv is not needed here.After generating all the CSV files using over_time.py script, we need to merge them in a single file. That file will be similar to the merged.csv file in the project Complexity Data.
We registered that there are some minor floating differences between our calculations and the ones from the author.
Note: The following code for replication consists on a .ipynb files that was run in a Google Colab session with CPU. To facilitate execution we strongly recommend to run this in the same environment.
Be aware that in our Github repository we have two main .ipynb notebooks:
Regression_PROVIDED_Authors_Data.ipynb
Only_industry_variable_as_predictor.ipynb
The Regression_PROVIDED_Authors_Data is the main notebook conducting the regression analysis in the same manner as the authors have done it in their project. We created 3 copys of the notebook. Each differ only in regard to the input csv-file. So to gain the same results as we did. You can either duplicate the provided notebook and change the input file or you just re-run the notebook and change the input file after every iteration.
That first notebook shows the procedure with the available data provided publicly by the authors over their github. You find the other data input files in our data subfolders.
To begin with, we first need to upload to the notebook’s environment the needed files. In case you want to run this code yourself, you should adjust the section in the beginning where you specific the csv-file location to a local folder where you have control about. If all previous steps are done correctly, the code’s execution is straightforward.
The notebooks build the models based on following, backward and both selection method for two different subsets: One with and one without industry encoded as a dummy variabe. In addition, the models are calculated for all variables, minimal models and only for significant variables.
The other notebook Only_industry_variable_as_predictor.ipynb was used to validate/falsify the hypothesis from the authors that the industry dummy variable alone accounts for 80% of the total variance. This notebook have a similiar preprocessing as the other one, but the building of the model is way easier with respect to having only one independent variable.