OpenTalker / StyleHEAT

[ECCV 2022] StyleHEAT: A framework for high-resolution editable talking face generation
MIT License
627 stars 77 forks source link

Performance Issue: Slow read_csv() Function with pandas Version 1.3.4 for CSV Files #50

Open TendouArisu opened 6 months ago

TendouArisu commented 6 months ago

Issue Description: Hello. I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192. I found that third_part/Deep3DFaceRecon_pytorch/models/arcface_torch/eval_ijbc.py, third_part/Deep3DFaceRecon_pytorch/models/arcface_torch/onnx_ijbc.py and third_part/Deep3DFaceRecon_pytorch/models/arcface_torch/utils/plot.py both used the influenced api.

Steps to Reproduce:

I have created a small reproducible example to better illustrate this issue.

# v1.3.4
import os
import pandas
import numpy
import timeit

def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)

# results
loaded dataframe shape: (5, 100000)
120.37690759263933
# v1.3.5
import os
import pandas
import numpy
import timeit
def generate_sample():
    if os.path.exists("test_small.csv.gz") == False:
        nb_col = 100000
        nb_row = 5
        feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
        for i in range(nb_col):
            feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
        df = pandas.DataFrame(feature_list)
        df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")

def load_csv_file():
    col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
    types_dict = {col: numpy.float32 for col in col_names}
    types_dict.update({'sample': str})
    feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
    print("loaded dataframe shape:", feature_df.shape)

generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
2.8567268839105964

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files. Any other workarounds or solutions would be greatly appreciated. Thank you!