Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that your repo used the influenced api.
Steps to Reproduce:
I have created a small reproducible example to better illustrate this issue.
# v1.3.4
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
120.37690759263933
# v1.3.5
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
2.8567268839105964
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!
Issue Description: Hello. I have discovered a performance degradation in the
read_csv
function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192. I found that your repo used the influenced api.Steps to Reproduce:
I have created a small reproducible example to better illustrate this issue.
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files. Any other workarounds or solutions would be greatly appreciated. Thank you!