Open AlexandraImbrisca opened 1 week ago
Hi @FlorianK13! I analysed our options for creating the benchmark and I'd like to hear your thoughts :)
Option 1: Using fake data
Option 2: Using the existing datasets as a base
@FlorianK13 what do you think? 🤔 If you think there is a need for option 1, I can definitely prioritize that :)
Option 2 seems good. It is less work, and all the data is open data, hence there is no need for anonymization. Since the benchmark would take quite long with the whole MaStR dataset, maybe you could manully choose a subset?
You can have a look at the zipped folder that is downloaded in ~/.open-mastr/data/xml-download
and choose a subset of xml files from there as your benchmark?
A quick solution would be the following:
db.download(date="existing")
to use the already downloaded folder (the one you manipulated)That should work as a benchmark, but if you have better ideas I'm also fine with that.
Sounds great! Thanks a lot for responding so fast! :) I'll create a few smaller datasets and include them in the benchmark.
I'll create a pull request with the benchmark today and add you as a reviewer
@FlorianK13 I created the pull request here: https://github.com/AlexandraImbrisca/open-MaStR/pull/2. Could you please add yourself as a reviewer? Github doesn't allow me to add you. It might be related to the access so I just invited you as a collaborator now :)
In order to evaluate different parsing strategies, we should create a benchmark that evaluates their speed.
It should take into account: