Use pandas to read observations

moeyensj commented 2 years ago

Reading hdf5 files and iterating over rows can be difficult especially if the underlying file structure changes with the addition or removal of columns. This PR replaces tables with pandas to read hdf5 files. pandas adds some very convenient functionality to allow iteration over an hdf5 file in chunks. Note, however, that this only works if the hdf5 file was saved with format="table" (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html).

I haven't tested the performance of this change.

moeyensj commented 2 years ago

The pandas-based code loaded /epyc/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2013-08-30_2013-09-30.h5 in 1:41:44: loading observations... ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18844145 / 18844145 100% 1:41:44 0:00:00

This file also contains the observatory code for every observation so my local version of precovery contains some changes to handle this new column but everything seems to have worked as intended.

moeyensj commented 2 years ago

I was unable to get a comparison time for the same window of observations as the previous comment using the precovery code on master (71bc140).

from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2013-08-30_2013-09-30.h5"
db.frames.load_hdf5(observations_file)

The following error is thrown (truncated output):

~/software/anaconda3/envs/precovery_py39/lib/python3.9/site-packages/tables/group.py in _g_load_child(self, childname)
   1176             childname = join_path(self._v_file.root_uep, childname)
   1177         # Is the node a group or a leaf?
-> 1178         node_type = self._g_check_has_child(childname)
   1179 
   1180         # Nodes that HDF5 report as H5G_UNKNOWN

~/software/anaconda3/envs/precovery_py39/lib/python3.9/site-packages/tables/group.py in _g_check_has_child(self, name)
    391         node_type = self._g_get_objinfo(name)
    392         if node_type == "NoSuchNode":
--> 393             raise NoSuchNodeError(
    394                 "group ``%s`` does not have a child named ``%s``"
    395                 % (self._v_pathname, name))

NoSuchNodeError: group ``/`` does not have a child named ``/data/table``

This seems to indicate that, at least for this particular window, the underlying file structure isn't the same as the one on which the code was built?

To get some performance numbers I then loaded the following window: nsc_dr2_observations_2019-08-09_2019-09-09 which we know worked on master (it's the example window). Limiting to the first 5000 frames: Precovery @ 71bc140 (master branch) :

from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:02:36 3:39:28

Then I repeated the same exercise with the code on this branch + some minor local changes. The observations file is slightly different in that the observatory codes have been added and certain columns removed to reduce the total file size but the underlying observations are the same.
Precovery @ e6a952b (this branch + minor local changes to load observatory codes from observation files) :

from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:04:52 5:31:43

So the pandas-based code is ~2x slower but has the benefit of allowing the underlying file formats to change. I might also be doing something inefficient in the current implementation. For example, loading observations visibly lags (the progress bar stutters) when it transitions to reading a new chunk. Iterating over the rows of a dataframe is also something that has never been fast in my experience.

@spenczar Thoughts on how to approach this problem? We could try using pandas.HDFStore for data extraction at a lower level rather than iterating over a dataframe. We could also revert back to the old code using tables and see if we can make it more robust to schema changes?

moeyensj commented 2 years ago

Swapping from pd.read_hdf(store, ...) to store.select(...) didn't yield any significant change in loading time:

from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:04:56 5:30:14

spenczar commented 2 years ago

This is nice, thanks.

2x slower is fine. We expect to run this rarely. It's much more important that we be able to load many different files, and that we're durable to changes in the underlying format.

My main concern was memory usage. Have you noticed any major difference in memory usage between the two?

moeyensj commented 2 years ago

My main concern was memory usage. Have you noticed any major difference in memory usage between the two?

I haven't been tracking but I'll try that next.

The main source of the bottlenecking is chunk.iterrows(), if we instead extract the np.ndarrays from the chunk and then iterate over those the performance between the code on master and this branch approaches 1:1.

from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:02:50 4:08:58

This is only 14 seconds slower than the run reported earlier : loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:02:36 3:39:28

moeyensj commented 2 years ago

Memory usage for code on this branch:

import tracemalloc
from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"

tracemalloc.start()
db.frames.load_hdf5(observations_file, limit=5000)

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:05:10 5:48:21
Current memory usage is 4.346746MB; Peak was 664.146418MB

Repeating the same with the master branch:

import tracemalloc
from precovery import precovery_db

db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2019-08-09_2019-09-09.h5"

tracemalloc.start()
db.frames.load_hdf5(observations_file, limit=5000)

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:04:52 5:54:48
Current memory usage is 2.387738MB; Peak was 6.209084MB

Memory tracing made the code run a lot slower in both cases. The memory burden is 100x for the pandas-based version but fortunately ~1 GB is still a small amount in memory?

moeyensj commented 2 years ago

Reducing the chunksize from 100000 to 10000 (1/10th) has the following effect on memory usage:

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:05:15 5:50:54
Current memory usage is 4.380576MB; Peak was 562.006212MB

Increasing the chunksize from 100000 to 1000000 (10x) has the following effect on memory usage:

loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991    2% 0:05:32 5:35:54
Current memory usage is 4.369954MB; Peak was 1680.281211MB

So we can somewhat tune the chunksize to fit system memory if need be but overall the memory burden is two orders of magnitude or more higher.

moeyensj commented 2 years ago

To summarize the above comments: The pandas-based code can handle changing file schemas and loads observations only a little slower (~14 seconds slower) than the current code on master but it comes at a cost of ~100x increase in memory usage from (6.2 to 660 MB).

If this is okay and we decide to merge this PR, then the new observation files to use with the precovery code can be found in: /epyc/projects/thor/thor_data/nsc/preprocessed. A few column names have changed and these changes are reflected in this branch.

I'll submit another PR to handle reading the observatory codes from the observation files later today.

spenczar commented 2 years ago

Thanks, I like this change. 1GB of memory is not too bad; at least we aren't loading the whole file into memory or anything. That's what I really wanted to avoid.

moeyensj commented 2 years ago

One final commit has been added to make _obscode_from_exposure_id accept a string exposure ID which helps avoid unnecessary encoding and decoding.

B612-Asteroid-Institute / precovery

Use pandas to read observations #2