Closed lukapeschke closed 7 months ago
As expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional.
The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike).
Some ideas that could help to mitigate this:
Using the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows):
import argparse
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
for sheet_name in excel_file.sheet_names:
excel_file.load_sheet_by_name(sheet_name).to_arrow()
if __name__ == "__main__":
main()
create_string_array
cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :)
Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file.
import polars as pl
import tempfile
case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]
df = pl.DataFrame({'test': test, 'case_number': case_numbers})
df[622, "case_number"] = "CASE-NO-0A60"
with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
filename = temp_file.name
df.write_csv(filename)
pl.read_csv(filename, n_rows=1, infer_schema_length=623)
I'll review when I'm back from holidays @lukapeschke Super glad fastexcel has been added to polars ecosystem as we first planned ;) Great job @lukapeschke @alexander-beedie
@PrettyWood sure no worries :slightly_smiling_face: I've released 0.8.0 in the meantime so that people can try out the windows wheels
@PrettyWood last changes add:
ExcelDateTime
)schema_sample_rows
param
closes #160