io.ascii.read() uses way too much memory

Gabriel-p commented 2 years ago

Description

io.ascii.read() uses too much memory

Expected behavior

A similar footprint as that of pandas

Actual behavior

Uses way more memory to load the exact same file compared to pandas.read_csv()

Steps to Reproduce

Use a somewhat large file (>500 Mb)
Load using ascii.read()
Load using pd.read_csv()
Compare how much memory each process employed

file = "some_large_file"
data = pd.read_csv(file, index_col=False)
data = ascii.read(file)

System Details

Linux-5.5.0-050500-generic-x86_64-with-glibc2.17
Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0]
Numpy 1.21.2
pyerfa 2.0.0
astropy 5.0
Scipy 1.7.3
Matplotlib 3.5.0

pllim commented 2 years ago

Do you have an example file and the benchmarks? Thanks!

Gabriel-p commented 2 years ago

Data file

The pd.read_csv() call uses less than 2 Gb of memory in my system to load the data file. If I call ascii.read() I have to manually kill the process after almost 5 Gb are used by it and the data file is still not read.

import pandas as pd
from astropy.io import ascii

path = "upk_53.dat"
data = ascii.read(path, delimiter=' ')
data = pd.read_csv(path, delimiter=' ')

neutrinoceros commented 7 months ago

The example file given about appears to be lost to the limbs of the cloud. However, the issue is simple enough to demonstrate with fake data:

# write.py
import numpy as np

arr = np.ones((32*1024, 1024), dtype="float64") # about a GB on disk
np.savetxt("/tmp/test.csv", arr)

# read_pandas.py
import pandas as pd

pd.read_csv("/tmp/test.csv", index_col=False)

# read_astropy.py
from astropy.io import ascii

ascii.read("/tmp/test.csv")

using memray, I observe that pandas is indeed pretty efficient (consuming ~1.2 Gb or RAM for a ~800Mb file), while astropy takes, indeed, way too much (about 16Gb) for the same file.

memray also pinpoints the problem to the following line: https://github.com/astropy/astropy/blob/cec24e83efea38ee18b80b7b34693a209d07d595/astropy/io/ascii/core.py#L336 Here, we consume about 8Gb of memory to construct a temporary list of strings each time we run it, and for reasons that are not yet cleat to me, it seems that 2 instances of that list co-exist at some point, and there go our 16Gb. I do stress that these lists are temporary, and are garbarge-collected after the reading process is over, so it seems likely that we could avoid that excessive memory consumption by using an iterator instead of a full blown list. I'm going to give it a try.

neutrinoceros commented 7 months ago

Writing a lazy line generator is simple enough, the difficult part is to refactor the internals of io.ascii because many places expect lines not to be consumed after the first loop, as a generator does. I'll give it some more time soon (hopefully tomorrow unless something comes up).

Gabriel-p commented 7 months ago

IS there a a need for this module to be around? Why not just let pandas handle the data IO?

pllim commented 7 months ago

Why not just let pandas handle the data IO?

pandas does not integrate with all the astropy features, e.g., ECSV, Quantity, etc.

astrofrog commented 7 months ago

One way to mitigate this would be to add the ability for io.ascii to load data in chunks and then vstack them at the end?

astrofrog commented 7 months ago

Alternatively what if we allowed lines to be a list like object which lazily loads data as needed/accessed? We might be able to then not modify a lot of existing code?

saimn commented 7 months ago

We already have chunks loading (#6458), and the memory issue is not new (#7871, #3334). It would be better to use a generator indeed and load lines when needed, but that's not working with - as far as I remember, looked at that long ago - probably headers and guessing mode and few other things where the first lines are used multiple times. Might be fixable but not easy to do given the number of readers.

neutrinoceros commented 7 months ago

Thanks for pointing these out @saimn ! I've started experimenting with a generator and indeed, the first hurdle to overcome is that some lines need to be read more than once, but I'll be thinking about it.

astropy / astropy