OpenEnergyPlatform / open-MaStR

A collaborative software to download the energy database Marktstammdatenregister (MaStR)
https://open-mastr.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
83 stars 17 forks source link

Increase parsing speed #546

Open FlorianK13 opened 1 month ago

FlorianK13 commented 1 month ago

This task contains several steps:

  1. Search different ways that might increase parsing speed. Parsing is done right now by the pandas.read_xml method here. Several alternatives are:

    • polars, duckdb, pyspark might have xml parsers and might be faster
    • use plain xml parsing from python (without pandas)
    • ...
  2. Writing to sqlite database right now is done by pandas.to_sql here. There might be other faster methods depending on step 1.

  3. Construct a benchmark in an own repository. Use a benchmark xml file from the Marktstammdatenregister and test different implementations for parsing them.

  4. Decide for a best method and implement it in open-mastr