Open FlorianK13 opened 4 months ago
Hi! I started working on this task and decided to change the steps slightly:
I was at DACH Energy Informatics Conference and took two points from there:
I was at DACH Energy Informatics Conference and took two points from there:
- Many researchers use open-mastr
- The feature request I heard most often was the question, if we can decrease the time it needs to download and parse the data I think people will be really happy if this issue is successful 😃
Sounds great. I think we cannot do much about the dl speed but I'm really looking forward to the parsing enhancement @AlexandraImbrisca :smile:
This task contains several steps:
Search different ways that might increase parsing speed. Parsing is done right now by the
pandas.read_xml
method here. Several alternatives are:polars
,duckdb
,pyspark
might have xml parsers and might be fasterWriting to sqlite database right now is done by
pandas.to_sql
here. There might be other faster methods depending on step 1.Construct a benchmark in an own repository. Use a benchmark xml file from the Marktstammdatenregister and test different implementations for parsing them.
Decide for a best method and implement it in
open-mastr