Closed moeyensj closed 2 years ago
The pandas
-based code loaded /epyc/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2013-08-30_2013-09-30.h5
in 1:41:44:
loading observations... ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18844145 / 18844145 100% 1:41:44 0:00:00
This file also contains the observatory code for every observation so my local version of precovery
contains some changes to handle this new column but everything seems to have worked as intended.
I was unable to get a comparison time for the same window of observations as the previous comment using the precovery code on master (71bc140).
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2013-08-30_2013-09-30.h5"
db.frames.load_hdf5(observations_file)
The following error is thrown (truncated output):
~/software/anaconda3/envs/precovery_py39/lib/python3.9/site-packages/tables/group.py in _g_load_child(self, childname)
1176 childname = join_path(self._v_file.root_uep, childname)
1177 # Is the node a group or a leaf?
-> 1178 node_type = self._g_check_has_child(childname)
1179
1180 # Nodes that HDF5 report as H5G_UNKNOWN
~/software/anaconda3/envs/precovery_py39/lib/python3.9/site-packages/tables/group.py in _g_check_has_child(self, name)
391 node_type = self._g_get_objinfo(name)
392 if node_type == "NoSuchNode":
--> 393 raise NoSuchNodeError(
394 "group ``%s`` does not have a child named ``%s``"
395 % (self._v_pathname, name))
NoSuchNodeError: group ``/`` does not have a child named ``/data/table``
This seems to indicate that, at least for this particular window, the underlying file structure isn't the same as the one on which the code was built?
To get some performance numbers I then loaded the following window: nsc_dr2_observations_2019-08-09_2019-09-09
which we know worked on master (it's the example window). Limiting to the first 5000 frames:
Precovery @ 71bc140 (master branch) :
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:02:36 3:39:28
Then I repeated the same exercise with the code on this branch + some minor local changes. The observations file is slightly different in that the observatory codes have been added and certain columns removed to reduce the total file size but the underlying observations are the same.
Precovery @ e6a952b (this branch + minor local changes to load observatory codes from observation files) :
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:04:52 5:31:43
So the pandas
-based code is ~2x slower but has the benefit of allowing the underlying file formats to change. I might also be doing something inefficient in the current implementation. For example, loading observations visibly lags (the progress bar stutters) when it transitions to reading a new chunk. Iterating over the rows of a dataframe is also something that has never been fast in my experience.
@spenczar Thoughts on how to approach this problem? We could try using pandas.HDFStore
for data extraction at a lower level rather than iterating over a dataframe. We could also revert back to the old code using tables and see if we can make it more robust to schema changes?
Swapping from pd.read_hdf(store, ...)
to store.select(...)
didn't yield any significant change in loading time:
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_master/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:04:56 5:30:14
This is nice, thanks.
2x slower is fine. We expect to run this rarely. It's much more important that we be able to load many different files, and that we're durable to changes in the underlying format.
My main concern was memory usage. Have you noticed any major difference in memory usage between the two?
My main concern was memory usage. Have you noticed any major difference in memory usage between the two?
I haven't been tracking but I'll try that next.
The main source of the bottlenecking is chunk.iterrows()
, if we instead extract the np.ndarray
s from the chunk and then iterate over those the performance between the code on master and this branch approaches 1:1.
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
db.frames.load_hdf5(observations_file, limit=5000)
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:02:50 4:08:58
This is only 14 seconds slower than the run reported earlier : loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:02:36 3:39:28
Memory usage for code on this branch:
import tracemalloc
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/preprocessed/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
tracemalloc.start()
db.frames.load_hdf5(observations_file, limit=5000)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:05:10 5:48:21
Current memory usage is 4.346746MB; Peak was 664.146418MB
Repeating the same with the master branch:
import tracemalloc
from precovery import precovery_db
db = precovery_db.PrecoveryDatabase.from_dir("/mnt/data/projects/thor/thor_data/nsc/precovery_branch/", create=True)
observations_file = "/mnt/data/projects/thor/thor_data/nsc/hdf5/nsc_dr2_observations_2019-08-09_2019-09-09.h5"
tracemalloc.start()
db.frames.load_hdf5(observations_file, limit=5000)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:04:52 5:54:48
Current memory usage is 2.387738MB; Peak was 6.209084MB
Memory tracing made the code run a lot slower in both cases. The memory burden is 100x for the pandas
-based version but fortunately ~1 GB is still a small amount in memory?
Reducing the chunksize from 100000 to 10000 (1/10th) has the following effect on memory usage:
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:05:15 5:50:54
Current memory usage is 4.380576MB; Peak was 562.006212MB
Increasing the chunksize from 100000 to 1000000 (10x) has the following effect on memory usage:
loading observations... ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1532469 / 68100991 2% 0:05:32 5:35:54
Current memory usage is 4.369954MB; Peak was 1680.281211MB
So we can somewhat tune the chunksize to fit system memory if need be but overall the memory burden is two orders of magnitude or more higher.
To summarize the above comments:
The pandas
-based code can handle changing file schemas and loads observations only a little slower (~14 seconds slower) than the current code on master but it comes at a cost of ~100x increase in memory usage from (6.2 to 660 MB).
If this is okay and we decide to merge this PR, then the new observation files to use with the precovery code can be found in: /epyc/projects/thor/thor_data/nsc/preprocessed
. A few column names have changed and these changes are reflected in this branch.
I'll submit another PR to handle reading the observatory codes from the observation files later today.
Thanks, I like this change. 1GB of memory is not too bad; at least we aren't loading the whole file into memory or anything. That's what I really wanted to avoid.
One final commit has been added to make _obscode_from_exposure_id
accept a string exposure ID which helps avoid unnecessary encoding and decoding.
Reading hdf5 files and iterating over rows can be difficult especially if the underlying file structure changes with the addition or removal of columns. This PR replaces
tables
withpandas
to read hdf5 files.pandas
adds some very convenient functionality to allow iteration over an hdf5 file in chunks. Note, however, that this only works if the hdf5 file was saved withformat="table"
(see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html).I haven't tested the performance of this change.