datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Loading many, many files at once #162

Closed cschloer closed 2 years ago

cschloer commented 2 years ago

Hey,

I've got an issue where lots of small files (600+) are loaded in together in one pipeline, using essentially 600+ load steps. There seems to be a memory leak, where even a worker with 32GB of memory is unable to process the pipeline. I think I've narrowed it down to this stream being opened here: https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py#L172

I believe the issue is that the process_datapackage function is called for EVERY load step first, before process_resource is called, and therefore before the stream is actually closed.

I think closing the stream inside the process_datapackage function, and then reopening in the process_resource function, would fix my problem.

Before I implement the fix, I wanted to check in and get your thoughts on whether my diagnosis actually makes any sense. Is it possible that these 600+ open streams are causing an issue? Is there something else in process_datapackage that could feasibly be causing this issue?

cschloer commented 2 years ago

@roll @akariv

cschloer commented 2 years ago

Actually, I don't think it has to do with keeping the stream open. If I comment that line out, the memory usage still slowly rises as it goes through each process_datapackage call. Is there some underlying issue with running 600+ processors in a single pipeline?

roll commented 2 years ago

It's quite an edge case. I think it needs to be profiled because it's hard to say without it

akariv commented 2 years ago

Hey @cschloer - just wondering, have you tried using the sources processor, built exactly for this purpose? I wonder how it would behave in your use case and if the memory issues persist or not.

https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#sources

cschloer commented 2 years ago

aha. No I haven't, I think because I haven't updated my dataflows since it was implemented! Thanks @akariv , I'll check it out.