Create a benchmark to evaluate the parsing speed

AlexandraImbrisca commented 1 week ago

In order to evaluate different parsing strategies, we should create a benchmark that evaluates their speed.

It should take into account:

the database schema
its statistical properties (e.g., the average/min/max values)
etc and calculate the time required to parse the data and insert it into the database.

AlexandraImbrisca commented 1 day ago

Hi @FlorianK13! I analysed our options for creating the benchmark and I'd like to hear your thoughts :)

Option 1: Using fake data

At first, I was thinking about generating a fake benchmark where we would:
- Mimic table records (e.g., randomly generated names and numbers)
- Create the tables and the relationships between the tables
That would be quite nice as we would be able to quite easily generate datasets of any size and include them in the benchmark
- The fact that the data is fake shouldn't be an issue as we want to determine only how fast we can read and store the data and not process or use it.
Once I started implementing it, I realized that it's not that trivial to do that since:
- The tables are quite numerous and diverse
- The dataset is quite big - the current compressed version has 2+ GB so, if we'd like to create a realistic dataset we'd need to fake a lot of data

Option 2: Using the existing datasets as a base

Instead of generating new fake datasets, we could directly use the current and the previous datasets as a base for the benchmark
If I understand correctly, the datasets are public and they shouldn't contain any personal identifiable information
- That helps a lot since we wouldn't have to anonymize the data
Besides the previous day, the Marktstammdatenregister contains 6 more datasets which can be analyzed and used directly:
- e.g., if the dataset seems to increase over time, I could use the current dataset and extend it to increase its size by x%
This option would allow us to use very accurate datasets easily. The execution time will increase sharply but it would lead to the most accurate results IMO

@FlorianK13 what do you think? 🤔 If you think there is a need for option 1, I can definitely prioritize that :)

FlorianK13 commented 1 day ago

Option 2 seems good. It is less work, and all the data is open data, hence there is no need for anonymization. Since the benchmark would take quite long with the whole MaStR dataset, maybe you could manully choose a subset? You can have a look at the zipped folder that is downloaded in ~/.open-mastr/data/xml-download and choose a subset of xml files from there as your benchmark?

A quick solution would be the following:

manually delete most of the xml files in the download folder
Use db.download(date="existing") to use the already downloaded folder (the one you manipulated)

That should work as a benchmark, but if you have better ideas I'm also fine with that.

AlexandraImbrisca commented 9 hours ago

Sounds great! Thanks a lot for responding so fast! :) I'll create a few smaller datasets and include them in the benchmark.

I'll create a pull request with the benchmark today and add you as a reviewer

AlexandraImbrisca commented 7 hours ago

@FlorianK13 I created the pull request here: https://github.com/AlexandraImbrisca/open-MaStR/pull/2. Could you please add yourself as a reviewer? Github doesn't allow me to add you. It might be related to the access so I just invited you as a collaborator now :)

AlexandraImbrisca / open-MaStR

Create a benchmark to evaluate the parsing speed #1