Open Gabriel-p opened 2 years ago
Do you have an example file and the benchmarks? Thanks!
The pd.read_csv()
call uses less than 2 Gb of memory in my system to load the data file. If I call ascii.read()
I have to manually kill the process after almost 5 Gb are used by it and the data file is still not read.
import pandas as pd
from astropy.io import ascii
path = "upk_53.dat"
data = ascii.read(path, delimiter=' ')
data = pd.read_csv(path, delimiter=' ')
The example file given about appears to be lost to the limbs of the cloud. However, the issue is simple enough to demonstrate with fake data:
# write.py
import numpy as np
arr = np.ones((32*1024, 1024), dtype="float64") # about a GB on disk
np.savetxt("/tmp/test.csv", arr)
# read_pandas.py
import pandas as pd
pd.read_csv("/tmp/test.csv", index_col=False)
# read_astropy.py
from astropy.io import ascii
ascii.read("/tmp/test.csv")
using memray, I observe that pandas is indeed pretty efficient (consuming ~1.2 Gb or RAM for a ~800Mb file), while astropy takes, indeed, way too much (about 16Gb) for the same file.
memray also pinpoints the problem to the following line: https://github.com/astropy/astropy/blob/cec24e83efea38ee18b80b7b34693a209d07d595/astropy/io/ascii/core.py#L336 Here, we consume about 8Gb of memory to construct a temporary list of strings each time we run it, and for reasons that are not yet cleat to me, it seems that 2 instances of that list co-exist at some point, and there go our 16Gb. I do stress that these lists are temporary, and are garbarge-collected after the reading process is over, so it seems likely that we could avoid that excessive memory consumption by using an iterator instead of a full blown list. I'm going to give it a try.
Writing a lazy line generator is simple enough, the difficult part is to refactor the internals of io.ascii
because many places expect lines
not to be consumed after the first loop, as a generator does. I'll give it some more time soon (hopefully tomorrow unless something comes up).
IS there a a need for this module to be around? Why not just let pandas handle the data IO?
Why not just let pandas handle the data IO?
pandas
does not integrate with all the astropy
features, e.g., ECSV, Quantity, etc.
One way to mitigate this would be to add the ability for io.ascii to load data in chunks and then vstack them at the end?
Alternatively what if we allowed lines to be a list like object which lazily loads data as needed/accessed? We might be able to then not modify a lot of existing code?
We already have chunks loading (#6458), and the memory issue is not new (#7871, #3334). It would be better to use a generator indeed and load lines when needed, but that's not working with - as far as I remember, looked at that long ago - probably headers and guessing mode and few other things where the first lines are used multiple times. Might be fixable but not easy to do given the number of readers.
Thanks for pointing these out @saimn ! I've started experimenting with a generator and indeed, the first hurdle to overcome is that some lines need to be read more than once, but I'll be thinking about it.
Description
io.ascii.read()
uses too much memoryExpected behavior
A similar footprint as that of
pandas
Actual behavior
Uses way more memory to load the exact same file compared to
pandas.read_csv()
Steps to Reproduce
ascii.read()
pd.read_csv()
System Details