Closed FlorianK13 closed 1 year ago
I compared an older pandas.read_xml() with the new 2.0 version which uses pyarrow.
Here is the code I ran multiple times to read the first 100 xml_files from "Gesamtdatenexport_20230427.zip" with both variants each:
from zipfile import ZipFile
import pandas as pd
import os
import time
path = os.path.join(os.path.expanduser("~"), ".open-MaStR", "data", "xml_download", "Gesamtdatenexport_20230427.zip")
with ZipFile(path, "r") as f:
files_list = f.namelist()
for k in range(10):
start_time_total = time.time()
measurements = [0] * 10
for j in range(10):
start_time = time.time()
for i in range(10):
data = f.read(files_list[i+j*10])
df = pd.read_xml(data, encoding="UTF-16", compression="zip", dtype_backend="pyarrow")
end_time = time.time()
measurements[j] = int(end_time - start_time)
end_time_total = time.time()
print(measurements)
total = end_time_total - start_time_total
print(f"Total execution time: {round(total/60, 2)} minutes")
(older variant does not set dtype_backend to pyarrow of course)
Here are the results in an environment with pandas version 1.5.3:
[40, 52, 53, 29, 24, 33, 157, 155, 146, 88]
Total execution time: 13.0 minutes
[43, 51, 47, 28, 23, 31, 150, 151, 145, 87]
Total execution time: 12.7 minutes
[42, 51, 46, 28, 24, 31, 147, 154, 143, 85]
Total execution time: 12.61 minutes
[42, 50, 47, 27, 24, 32, 150, 150, 143, 86]
Total execution time: 12.59 minutes
[41, 52, 46, 28, 23, 31, 148, 149, 141, 87]
Total execution time: 12.52 minutes
[42, 50, 46, 28, 23, 31, 149, 150, 143, 86]
Total execution time: 12.52 minutes
[41, 51, 46, 28, 22, 31, 148, 150, 142, 85]
Total execution time: 12.51 minutes
[42, 51, 46, 28, 23, 33, 150, 152, 144, 87]
Total execution time: 12.69 minutes
[42, 51, 47, 28, 24, 31, 150, 150, 145, 86]
Total execution time: 12.66 minutes
[42, 51, 46, 28, 23, 31, 150, 149, 142, 86]
Total execution time: 12.55 minutes
[42, 51, 47, 27, 23, 31, 149, 149, 143, 85]
Total execution time: 12.54 minutes
[43, 53, 48, 28, 23, 31, 151, 152, 142, 88]
Total execution time: 12.74 minutes
[50, 56, 51, 32, 25, 37, 165, 165, 149, 89]
Total execution time: 13.74 minutes
[34, 49, 50, 30, 24, 33, 163, 169, 156, 94]
Total execution time: 13.45 minutes
[49, 57, 53, 32, 26, 36, 169, 175, 174, 98]
Total execution time: 14.54 minutes
[47, 57, 54, 32, 25, 36, 167, 171, 170, 99]
Total execution time: 14.38 minutes
[50, 59, 52, 33, 26, 34, 166, 165, 157, 98]
Total execution time: 14.1 minutes
[50, 56, 51, 31, 26, 35, 164, 169, 170, 102]
Total execution time: 14.31 minutes
[46, 56, 51, 31, 25, 34, 165, 166, 158, 95]
Total execution time: 13.89 minutes
[46, 55, 51, 31, 26, 34, 164, 168, 157, 95]
Total execution time: 13.87 minutes
[47, 55, 54, 31, 26, 34, 164, 166, 159, 95]
Total execution time: 13.92 minutes
[44, 54, 49, 29, 24, 33, 159, 164, 157, 94]
Total execution time: 13.56 minutes
[44, 54, 49, 30, 24, 33, 157, 161, 150, 92]
Total execution time: 13.32 minutes
[48, 59, 53, 35, 25, 34, 165, 162, 152, 94]
Total execution time: 13.88 minutes
[47, 57, 52, 31, 26, 35, 170, 172, 164, 97]
Total execution time: 14.23 minutes
[47, 57, 52, 32, 28, 40, 169, 168, 241, 101]
Total execution time: 15.64 minutes
[44, 53, 49, 29, 24, 33, 160, 170, 172, 97]
Total execution time: 13.93 minutes
[43, 53, 48, 28, 24, 32, 152, 154, 150, 89]
Total execution time: 12.96 minutes
[39, 48, 44, 27, 22, 30, 146, 150, 144, 87]
Total execution time: 12.35 minutes
Here are the results in an environment with pandas version 2.0.1 having pyarrow:
[48, 57, 54, 35, 31, 35, 153, 155, 143, 90]
Total execution time: 13.44 minutes
[48, 57, 56, 35, 30, 35, 154, 153, 143, 90]
Total execution time: 13.43 minutes
[48, 57, 53, 35, 30, 35, 153, 158, 149, 89]
Total execution time: 13.53 minutes
[48, 58, 53, 35, 30, 35, 152, 153, 142, 88]
Total execution time: 13.31 minutes
[47, 57, 55, 35, 30, 34, 152, 152, 142, 88]
Total execution time: 13.27 minutes
[48, 56, 53, 36, 30, 35, 152, 152, 141, 88]
Total execution time: 13.26 minutes
[48, 56, 52, 34, 30, 34, 151, 152, 140, 87]
Total execution time: 13.13 minutes
[47, 56, 67, 35, 29, 34, 149, 149, 139, 87]
Total execution time: 13.3 minutes
[46, 56, 52, 34, 30, 34, 147, 147, 139, 86]
Total execution time: 12.93 minutes
[45, 57, 54, 35, 32, 36, 150, 148, 136, 85]
Total execution time: 13.04 minutes
[48, 56, 53, 35, 30, 36, 154, 154, 145, 88]
Total execution time: 13.4 minutes
[51, 61, 57, 37, 32, 37, 162, 163, 149, 93]
Total execution time: 14.11 minutes
[50, 61, 56, 37, 34, 37, 162, 161, 150, 95]
Total execution time: 14.13 minutes
[52, 62, 57, 37, 33, 37, 163, 162, 152, 94]
Total execution time: 14.21 minutes
[54, 64, 60, 39, 34, 39, 168, 166, 155, 95]
Total execution time: 14.66 minutes
[54, 65, 60, 40, 34, 40, 175, 173, 160, 99]
Total execution time: 15.09 minutes
[55, 67, 63, 39, 34, 40, 173, 175, 163, 99]
Total execution time: 15.23 minutes
[54, 64, 60, 39, 34, 39, 176, 174, 162, 101]
Total execution time: 15.12 minutes
[38, 52, 53, 34, 31, 42, 161, 164, 159, 98]
Total execution time: 13.92 minutes
[43, 52, 57, 35, 31, 36, 158, 156, 149, 91]
Total execution time: 13.56 minutes
[50, 60, 56, 36, 32, 36, 157, 158, 149, 90]
Total execution time: 13.79 minutes
[50, 60, 56, 37, 31, 37, 158, 161, 145, 91]
Total execution time: 13.86 minutes
[49, 59, 54, 36, 31, 38, 156, 157, 147, 92]
Total execution time: 13.73 minutes
[51, 59, 56, 36, 31, 35, 157, 156, 149, 90]
Total execution time: 13.73 minutes
[49, 58, 55, 36, 31, 36, 158, 157, 146, 90]
Total execution time: 13.67 minutes
[50, 58, 55, 38, 31, 36, 157, 156, 146, 91]
Total execution time: 13.72 minutes
[50, 59, 55, 36, 32, 36, 159, 159, 147, 90]
Total execution time: 13.81 minutes
[51, 60, 55, 36, 32, 36, 159, 158, 147, 91]
Total execution time: 13.84 minutes
From my results there is no significant difference between the two variants, though you could say that the new variant is more consistent.
The following Post mentions an increased performance of read_csv() using pyarrow, it would be worth checking out, if read_csv() is being used. https://medium.datadriveninvestor.com/pandas-2-0-and-pyarrow-vs-pandas-2-0-and-numpy-e24a79404b41
Thanks @Grazvy We mainly use the read_xml and not the read_csv, hence there we can not really get any performance increase. Looking at your results we do not need to change the backend to pyarrow.
Description of the issue
The bulk download, import and cleansing is really slow. We should check if either polars or pandas 2.0 have solutions that speed up the process.
Workflow checklist