OpenEnergyPlatform / open-MaStR

A collaborative software to download the energy database Marktstammdatenregister (MaStR)
https://open-mastr.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
88 stars 19 forks source link

Increase parsing speed #546

Open FlorianK13 opened 4 months ago

FlorianK13 commented 4 months ago

This task contains several steps:

  1. Search different ways that might increase parsing speed. Parsing is done right now by the pandas.read_xml method here. Several alternatives are:

    • polars, duckdb, pyspark might have xml parsers and might be faster
    • use plain xml parsing from python (without pandas)
    • ...
  2. Writing to sqlite database right now is done by pandas.to_sql here. There might be other faster methods depending on step 1.

  3. Construct a benchmark in an own repository. Use a benchmark xml file from the Marktstammdatenregister and test different implementations for parsing them.

  4. Decide for a best method and implement it in open-mastr

AlexandraImbrisca commented 1 month ago

Hi! I started working on this task and decided to change the steps slightly:

  1. Construct the benchmark
    • Use the Marktstammdatenregister to construct a few datasets of various size - ✅ (link)
    • Create a script to automate the calculation and comparison of the parsing speed between various optimisations - ✅ (link)
  2. Explore faster methods of parsing the XML
    • Research the options and implement the changes
    • Run the benchmark and analyse the results
  3. Explore faster methods of writing to the sqlite database
    • Research the options and implement the changes
    • Run the benchmark and analyse the results
  4. Decide on the best method and add it to this repository
FlorianK13 commented 1 month ago

I was at DACH Energy Informatics Conference and took two points from there:

nesnoj commented 1 month ago

I was at DACH Energy Informatics Conference and took two points from there:

  • Many researchers use open-mastr
  • The feature request I heard most often was the question, if we can decrease the time it needs to download and parse the data I think people will be really happy if this issue is successful 😃

Sounds great. I think we cannot do much about the dl speed but I'm really looking forward to the parsing enhancement @AlexandraImbrisca :smile: