Use Polars or Pandas 2.0 functionality for speed up

FlorianK13 commented 1 year ago

Description of the issue

The bulk download, import and cleansing is really slow. We should check if either polars or pandas 2.0 have solutions that speed up the process.

Workflow checklist

[x] I am aware of the workflow in CONTRIBUTING.md

Grazvy commented 1 year ago

I compared an older pandas.read_xml() with the new 2.0 version which uses pyarrow.

Here is the code I ran multiple times to read the first 100 xml_files from "Gesamtdatenexport_20230427.zip" with both variants each:

from zipfile import ZipFile
import pandas as pd
import os
import time

path = os.path.join(os.path.expanduser("~"), ".open-MaStR", "data", "xml_download", "Gesamtdatenexport_20230427.zip")
with ZipFile(path, "r") as f:
    files_list = f.namelist()

    for k in range(10):
        start_time_total = time.time()
        measurements = [0] * 10

        for j in range(10):
            start_time = time.time()

            for i in range(10):
                data = f.read(files_list[i+j*10])
                df = pd.read_xml(data, encoding="UTF-16", compression="zip", dtype_backend="pyarrow")

            end_time = time.time()

            measurements[j] = int(end_time - start_time)

        end_time_total = time.time()
        print(measurements)

        total = end_time_total - start_time_total
        print(f"Total execution time: {round(total/60, 2)} minutes")

(older variant does not set dtype_backend to pyarrow of course)

Here are the results in an environment with pandas version 1.5.3:

[40, 52, 53, 29, 24, 33, 157, 155, 146, 88]
Total execution time: 13.0 minutes

[43, 51, 47, 28, 23, 31, 150, 151, 145, 87]
Total execution time: 12.7 minutes

[42, 51, 46, 28, 24, 31, 147, 154, 143, 85]
Total execution time: 12.61 minutes

[42, 50, 47, 27, 24, 32, 150, 150, 143, 86]
Total execution time: 12.59 minutes

[41, 52, 46, 28, 23, 31, 148, 149, 141, 87]
Total execution time: 12.52 minutes

[42, 50, 46, 28, 23, 31, 149, 150, 143, 86]
Total execution time: 12.52 minutes

[41, 51, 46, 28, 22, 31, 148, 150, 142, 85]
Total execution time: 12.51 minutes

[42, 51, 46, 28, 23, 33, 150, 152, 144, 87]
Total execution time: 12.69 minutes

[42, 51, 47, 28, 24, 31, 150, 150, 145, 86]
Total execution time: 12.66 minutes

[42, 51, 46, 28, 23, 31, 150, 149, 142, 86]
Total execution time: 12.55 minutes

[42, 51, 47, 27, 23, 31, 149, 149, 143, 85]
Total execution time: 12.54 minutes

[43, 53, 48, 28, 23, 31, 151, 152, 142, 88]
Total execution time: 12.74 minutes

[50, 56, 51, 32, 25, 37, 165, 165, 149, 89]
Total execution time: 13.74 minutes

[34, 49, 50, 30, 24, 33, 163, 169, 156, 94]
Total execution time: 13.45 minutes

[49, 57, 53, 32, 26, 36, 169, 175, 174, 98]
Total execution time: 14.54 minutes

[47, 57, 54, 32, 25, 36, 167, 171, 170, 99]
Total execution time: 14.38 minutes

[50, 59, 52, 33, 26, 34, 166, 165, 157, 98]
Total execution time: 14.1 minutes

[50, 56, 51, 31, 26, 35, 164, 169, 170, 102]
Total execution time: 14.31 minutes

[46, 56, 51, 31, 25, 34, 165, 166, 158, 95]
Total execution time: 13.89 minutes

[46, 55, 51, 31, 26, 34, 164, 168, 157, 95]
Total execution time: 13.87 minutes

[47, 55, 54, 31, 26, 34, 164, 166, 159, 95]
Total execution time: 13.92 minutes

[44, 54, 49, 29, 24, 33, 159, 164, 157, 94]
Total execution time: 13.56 minutes

[44, 54, 49, 30, 24, 33, 157, 161, 150, 92]
Total execution time: 13.32 minutes

[48, 59, 53, 35, 25, 34, 165, 162, 152, 94]
Total execution time: 13.88 minutes

[47, 57, 52, 31, 26, 35, 170, 172, 164, 97]
Total execution time: 14.23 minutes

[47, 57, 52, 32, 28, 40, 169, 168, 241, 101]
Total execution time: 15.64 minutes

[44, 53, 49, 29, 24, 33, 160, 170, 172, 97]
Total execution time: 13.93 minutes

[43, 53, 48, 28, 24, 32, 152, 154, 150, 89]
Total execution time: 12.96 minutes

[39, 48, 44, 27, 22, 30, 146, 150, 144, 87]
Total execution time: 12.35 minutes

Here are the results in an environment with pandas version 2.0.1 having pyarrow:

[48, 57, 54, 35, 31, 35, 153, 155, 143, 90]
Total execution time: 13.44 minutes

[48, 57, 56, 35, 30, 35, 154, 153, 143, 90]
Total execution time: 13.43 minutes

[48, 57, 53, 35, 30, 35, 153, 158, 149, 89]
Total execution time: 13.53 minutes

[48, 58, 53, 35, 30, 35, 152, 153, 142, 88]
Total execution time: 13.31 minutes

[47, 57, 55, 35, 30, 34, 152, 152, 142, 88]
Total execution time: 13.27 minutes

[48, 56, 53, 36, 30, 35, 152, 152, 141, 88]
Total execution time: 13.26 minutes

[48, 56, 52, 34, 30, 34, 151, 152, 140, 87]
Total execution time: 13.13 minutes

[47, 56, 67, 35, 29, 34, 149, 149, 139, 87]
Total execution time: 13.3 minutes

[46, 56, 52, 34, 30, 34, 147, 147, 139, 86]
Total execution time: 12.93 minutes

[45, 57, 54, 35, 32, 36, 150, 148, 136, 85]
Total execution time: 13.04 minutes

[48, 56, 53, 35, 30, 36, 154, 154, 145, 88]
Total execution time: 13.4 minutes

[51, 61, 57, 37, 32, 37, 162, 163, 149, 93]
Total execution time: 14.11 minutes

[50, 61, 56, 37, 34, 37, 162, 161, 150, 95]
Total execution time: 14.13 minutes

[52, 62, 57, 37, 33, 37, 163, 162, 152, 94]
Total execution time: 14.21 minutes

[54, 64, 60, 39, 34, 39, 168, 166, 155, 95]
Total execution time: 14.66 minutes

[54, 65, 60, 40, 34, 40, 175, 173, 160, 99]
Total execution time: 15.09 minutes

[55, 67, 63, 39, 34, 40, 173, 175, 163, 99]
Total execution time: 15.23 minutes

[54, 64, 60, 39, 34, 39, 176, 174, 162, 101]
Total execution time: 15.12 minutes

[38, 52, 53, 34, 31, 42, 161, 164, 159, 98]
Total execution time: 13.92 minutes

[43, 52, 57, 35, 31, 36, 158, 156, 149, 91]
Total execution time: 13.56 minutes

[50, 60, 56, 36, 32, 36, 157, 158, 149, 90]
Total execution time: 13.79 minutes

[50, 60, 56, 37, 31, 37, 158, 161, 145, 91]
Total execution time: 13.86 minutes

[49, 59, 54, 36, 31, 38, 156, 157, 147, 92]
Total execution time: 13.73 minutes

[51, 59, 56, 36, 31, 35, 157, 156, 149, 90]
Total execution time: 13.73 minutes

[49, 58, 55, 36, 31, 36, 158, 157, 146, 90]
Total execution time: 13.67 minutes

[50, 58, 55, 38, 31, 36, 157, 156, 146, 91]
Total execution time: 13.72 minutes

[50, 59, 55, 36, 32, 36, 159, 159, 147, 90]
Total execution time: 13.81 minutes

[51, 60, 55, 36, 32, 36, 159, 158, 147, 91]
Total execution time: 13.84 minutes

From my results there is no significant difference between the two variants, though you could say that the new variant is more consistent.

Grazvy commented 1 year ago

The following Post mentions an increased performance of read_csv() using pyarrow, it would be worth checking out, if read_csv() is being used. https://medium.datadriveninvestor.com/pandas-2-0-and-pyarrow-vs-pandas-2-0-and-numpy-e24a79404b41

FlorianK13 commented 1 year ago

Thanks @Grazvy We mainly use the read_xml and not the read_csv, hence there we can not really get any performance increase. Looking at your results we do not need to change the backend to pyarrow.

OpenEnergyPlatform / open-MaStR

Use Polars or Pandas 2.0 functionality for speed up #441

Description of the issue

Workflow checklist