datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Speed Improvements #179

Closed cschloer closed 2 months ago

cschloer commented 3 months ago

Overview

Hi @akariv ,

A little background:

I have a ~50MB file (500,000 lines) that I'm using for testing. It takes about 90 seconds to process in a Flow that has a single load step and single dump step. Both are custom steps, though they are copied pretty closely from the standard load and standard dump steps. In the load step, the file is first loaded entirely into memory, and then that io object is passed farther down the pipeline. In the dump step, the file is uploaded to S3 (I'm actually doing multipart upload as the rows stream in and it goes much faster, but I removed that to make this explanation easier. Additionally the download and upload go much faster when I run the code on AWS servers).

The network IO for loading in the file takes 11 seconds. The network IO for uploading the files takes 47 seconds. What is happening the remaining 32 seconds??

I've done some profiling in the for row in rows loop in the dump step and every 10k rows takes about 0.6 seconds. 0.12 of that is the CSV writer, but the other 80% of time is unaccounted for. I'm not doing any other processing, so I'm confused what is making it so slow!

Do you have any insights on to what might be causing this issue? At first I thought it might be related to using iterators instead of lists, but reading into that it seems like iterators should actually improve performance (and not just the memory requirements). And it's certainly not an IO issue, since the file is loaded entirely into memory at the beginning of the flow.

Relatedly, if you have any other ideas of low hanging fruit for performance improvements, I'd be happy to take a swing at implementing them.


Please preserve this line to notify @akariv (lead of this repository)

akariv commented 3 months ago

Hey @cschloer, I will have to do some profiling of my own to have better answers - I'll try to do that in the upcoming week and I'll update here. By the way, based on your previous issue I made some performance improvements in the kvfile library so by default it stores data in memory and only after passing some threshold (~10K items) it will start putting data on disk.

cschloer commented 3 months ago

Thanks @akariv , I appreciate it! I'll check out the change you made to the kvfile library.

akariv commented 3 months ago

@cschloer

I did some tests on a similar scenario to the one you described. In that simple case, the majority of the time goes to data validation and conversion. (Data validation happens in the validate processor, when you're using a file dumper (e.g. dump_to_path) or when calling Flow.results()). I did some work on the underlying libraries so that some data types should work faster (namely strings, integers and numbers). Also, if you are calling Flow.results() and not Flow.process(), you can now add on_error=None to disable that extra validation there.

All this goodness in the latest 0.5.5 release :) lmk if that improved things in your setup!

cschloer commented 2 months ago

What used to take 32 seconds now takes 20 seconds. Pretty big improvement. Thanks!