Closed jyates-om1 closed 6 months ago
Thanks for the detailed report, I'll take a look at it in the following days.
If AsyncReader also leaks memory, it's possible that the leak is in the C part of the library. I'm unsure whether tracemalloc would correctly pickup stack frames from C without debug symbols; possibly memray alongside an aiocsv build with debug symbols would show more accurate locations of the leaked memory sources.
Got you, I can play around with it some after work hours as well. I haven't touched c since college though, so no promises!
Fixed in a2ca38ccd478e35b64fc6d5d74a484a28f72e993
Thanks so much for taking a look! I'll keep an eye out for a pypi release, but I'd be happy to validate a RC build if you want
If you really want to validate the changes, build the library from source (python -m build
builds a wheel, and pip install /path/to/aiocsv
automatically compiles and installs the library in the current environment; provided you have Python headers and a C compiler installed) and run with that. I won't be sharing pre-compiled wheels for non-release commits, as it's generally impossible to cross-compile them locally; wheels are automatically build for releases by cibuildwheel and I don't want to waste runtime minutes on irrelevant builds.
I've released 1.3.2, pre-built wheels should be available in ~15 minutes.
Thanks for the help! I grabbed 1.3.2 via pypi and the same code now runs without a memory leak.
Summary
When iterating over
AsyncDictReader
orAsyncReader
, we are seeing large amounts of memory being used. After introducing a memory profiler, it looks likeaiocsv
continues to consume memory in amounts proportional to the size of the input file.Background
We have been using the
aiocsv
library for transforming data from large csv files, and it has been awesome to use! Some of our files are huge (7 - 20 GB), so we want to be as memory-efficient as possible when reading them. We've noticed thatAsyncDictReader
andAsyncReader
both consume more memory as time goes on while iterating over the csv usingasync for row in reader
. This memory is then released when the context manager closes down.I've been poking around in the
aiocsv
code a bit but haven't had any success figuring out where the memory leak is occurring, other than the reference to/usr/local/lib/python3.11/site-packages/aiocsv/readers.py:43
from thetracemalloc
library. I'd be happy to contribute a PR if I find the source.Sample Code
Sample memory profiler output
Environment info
versions: python=3.11.4 aiocsv==1.3.1 aiofiles==23.2.1
os: debian 10 (buster), run via docker