Reading in data that is too large for local memory

caseresearch / code-review

⛔️ [DEPRECATED] A repo for code review sessions at CAS. ⛔️ [DEPRECATED] See

https://github.com/swincas/code-review

MIT License

0 stars 41 forks source link

Reading in data that is too large for local memory #11

Closed jamesesdaile closed 6 years ago

jamesesdaile commented 7 years ago

If I'm reading in data that is too large for local memory, what options do I have other than reading in line by line and using each read line in a function (this seems really slow in python)?

What can I do if the file is small enough to fit in local memory but really big such that the read in time is really long?

James

manodeep commented 7 years ago

Will you please describe your data format? And perhaps post some snippet of code that you are trying to execute on the data.

You can also attach files to this issue directly

jamesesdaile commented 7 years ago

File format was originally a binary file (in big endian format) which was read in entirely to get relevant information. I wrote out an asci file in a single column with relevant data. In the end writing this file out as binary proved to be most useful as I could then read into memory quickly (it wasn't too big for this).

The end product is a histogram. The relevant data in the asci file needs to be converted via a function to another value before it is binned into the histogram.

!Read in data file x = numpy.loadtxt('path_to_file') !Convert into new value with function x' = f(x) !bin new values in histogram counts,bins,patches = matplotlib.pyplot.histogram(x')

Apparently pandas has a read in file 'chunk' that would allow faster processing rather than line by line (if the data file was too big to fit in local memory)? Maybe we could cover general file I/O management in a code review or go over my initial problem as a learning activity?

manodeep commented 7 years ago

Do you have a working solution? If so, you could show us what worked, and perhaps what didn't.

karlglazebrook commented 7 years ago

In reading large ascii files, you need to buffer. (I wrote PDL code that does it this way about 10 years ago). This is because they use memory much more inefficiently than packed binary data. Surprised loadtxt does not do this. Have you tried astropy.io.ascii.read() ? Or you can convert to FITS first using IRAF. (rspectext)

manodeep commented 7 years ago

@karlglazebrook Looks like astropy.io.ascii.read does not support chunked reading. See Issue 3334

manodeep commented 7 years ago

@karlglazebrook I saw your comment on the astropy issue. The referenced pdl code does solve the case where the binary data fits in memory (but not the ascii data), I think there is a potentially a more elegant solution involving generators and async io routines.

manodeep commented 7 years ago

Solution in PR #12