datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Use of KVFile? #178

Closed cschloer closed 3 months ago

cschloer commented 3 months ago

Overview

Hey @akariv -

What is the purpose of using the kvfile library in the join processor? I just replaced it with a native python dictionary and a pipeline that had multiple join steps ended up running significantly faster (from ~5 minutes down to 5 seconds - and I confirmed that the files were the same after). Am I missing something when I remove it?

My only suspicion is that it would help if you were doing some kind of multiprocessing, but that's not currently happening (at least with datapackage pipelines).

Thanks!


Please preserve this line to notify @akariv (lead of this repository)

akariv commented 3 months ago

Hey @cschloer - the purpose is to be able to process large amounts of data without being limited by the machine's RAM. Using a native in-memory dict is indeed much faster but might cause the program to choke of fail in case there's not enough memory.

It might be a good idea to allow the user to specify if an in-memory join is preferable to achieve faster processing.

By the way, the basic KVfile implementation uses SQLite which is commonly available but slow. If leveldb is installed on the machine, it will use that instead as it is ~3 times faster (in my experience, ymmv...).

cschloer commented 3 months ago

Thanks for the quick reply, good to know! I'll switch to using dicts for now and if I run into memory issues look into the leveldb solution.