Closed cschloer closed 2 years ago
@roll @akariv
Actually, I don't think it has to do with keeping the stream open. If I comment that line out, the memory usage still slowly rises as it goes through each process_datapackage
call. Is there some underlying issue with running 600+ processors in a single pipeline?
It's quite an edge case. I think it needs to be profiled because it's hard to say without it
Hey @cschloer - just wondering, have you tried using the sources
processor, built exactly for this purpose?
I wonder how it would behave in your use case and if the memory issues persist or not.
https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#sources
aha. No I haven't, I think because I haven't updated my dataflows since it was implemented! Thanks @akariv , I'll check it out.
Hey,
I've got an issue where lots of small files (600+) are loaded in together in one pipeline, using essentially 600+ load steps. There seems to be a memory leak, where even a worker with 32GB of memory is unable to process the pipeline. I think I've narrowed it down to this stream being opened here: https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py#L172
I believe the issue is that the
process_datapackage
function is called for EVERY load step first, beforeprocess_resource
is called, and therefore before the stream is actually closed.I think closing the stream inside the
process_datapackage
function, and then reopening in theprocess_resource
function, would fix my problem.Before I implement the fix, I wanted to check in and get your thoughts on whether my diagnosis actually makes any sense. Is it possible that these 600+ open streams are causing an issue? Is there something else in
process_datapackage
that could feasibly be causing this issue?