SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
5 stars 6 forks source link

Optimization of database storage #396

Open ttngu207 opened 2 months ago

ttngu207 commented 2 months ago

Data size stored in mysql server (on aeon-db2) is approaching 400GB This will hit the storage size limit of the aeon-db2 server, at which point we will need to request for a disk size increase.

I'd like to discuss potential solutions to better engineer/optimize for storage space in our next data architecture meeting.

The bulk of the storage is in the streams schema, where we extract and store the data from relevant streams. Among these, the majority of the storage is at UndergroundFeederEncoder (the wheel data) (~100GB) and SpinnakerVideoSourceVideo (~80GB)

ttngu207 commented 2 months ago

I'm sharing here the current size (in GB) of each schema in the aeon-db2 database

{'aeon_streams': 186.279919616,
 'aeon_archived_exp02_acquisition': 42.648174592,
 'aeon_worker': 0.092962816,
 'aeon_tracking': 32.597950464,
 'aeon_archived_exp01_analysis': 0.010092544,
 'aeon_analysis': 0.00188416,
 'aeon_archived_exp02_subject': 0.000835584,
 'aeon_archived_exp02_worker': 0.000131072,
 'aeon_report': 4.9152e-05,
 'aeon_block_analysis': 23.938367488,
 'aeon_archived_exp02_report': 0.002752512,
 'aeon_device': 3.2768e-05,
 'u_thinh_aeonfix': 0.00475136,
 'aeon_subject': 0.000638976,
 'aeon_archived_exp02_streams': 0.000507904,
 'aeon_archived_exp01_acquisition': 0.10944512,
 'aeon_archived_exp02_qc': 24.861868032,
 'aeon_archived_exp02_analysis': 4.933648384,
 'aeon_lab': 0.000442368,
 'aeon_archived_exp01_lab': 0.000425984,
 'aeon_archived_exp01_subject': 0.000294912,
 'aeon_archived_exp01_report': 0.041025536,
 'aeon_archived_exp01_qc': 26.919780352,
 'aeon_archived_exp02_lab': 0.000442368,
 'aeon_acquisition': 0.121913344,
 'aeon_archived_exp02_tracking': 9.951330304,
 'aeon_qc': 4.260446208,
 'aeon_archived_exp01_tracking': 1.720942592}
jkbhagatio commented 2 months ago

Ah yes, I'm happy downsampling the wheel data to 50 hz as something quick we can do immediately

lochhh commented 2 months ago

@ttngu207 proposed 2 possible ways to downsample the wheel data:

@ttngu207 feel free to expand/correct my comment here

ttngu207 commented 2 months ago

Hi @jkbhagatio , We discussed this issue a bit more with @lochhh and @anayapouget today.

Downsampling the wheel data to 50Hz is a good immediate step we can take to optimize the storage. On this, I think there are 2 ways to accomplish this

  1. Add logic to do down sampling at the step of ingestion with DataJoint.
    • Pros: simple, easy
    • Cons: mismatch in how the data in files are stored vs. how it's stored in DataJoint. For all of the tables in streams, we have been maintaining this consistency, breaking it here is something to think about
  2. Create another set of the downsampled wheel data files in processed folder - this will get ingested
    • Pros: zero change needed in the current ingestion routine, and maintaining the consistency in files and ingested data
    • Cons: additional routine needed to do the downsampling and store downsample files in processed (perhaps together with, or after, the data copy step)
ttngu207 commented 2 months ago

Technically, at the streams level, we want to maintain consistency between the ingested data with the "read" data - i.e. the dataframe returned by the StreamReader class (not strictly the data in the raw files). So, perhaps a 3rd option to consider is to update the StreamReader for the wheel data (io.reader.Encoder) to have a downsampling keyword argument (or downsampled_rate), and we use this option for ingestion.

This approach would also require zero change in the current ingestion flow, only need to update the reader in aeon.schemas.schema, and still maintain consistency between data returned by the reader and data ingested

ttngu207 commented 2 months ago

@jkbhagatio @glopesdev Let us know your thoughts on this, if this is a reasonable approach, I'll start the implementation.

Currently, data ingestion for social0.4 is paused due to this storage problem

glopesdev commented 2 months ago

@ttngu207 for wheel data, number 3. sounds preferable for now, especially since it avoids data duplication.

For the video CSV it's a bit harder to imagine a compressed scheme, especially since it seems from the numbers that the stored data is already in binary form. The only other option would be to save timestamps only and table indices and pull data on demand post-query if needed.

I can push to implement the downsampling parameter ASAP for wheels since it seems this would unblock ingestion the fastest.

ttngu207 commented 2 months ago

@glopesdev agreed, I also like option 3 best. The video CSV is definitely trickier, we can deal with that later It would be great if we can prioritize the downsampling parameter in the io.reader.Encoder, that would allow ingestion to resume.

jerlich commented 2 months ago

what is the underlying database engine? innodb? is page compression enabled? you can also get savings depending on the underlying file system.

jkbhagatio commented 2 months ago

@ttngu207 @glopesdev, I'm also happy with option 3 mentioned above

glopesdev commented 1 month ago

@jkbhagatio from our meeting today we were converging on a downsampling strategy and API for the wheel data. Since the encoder reports absolute wheel position we can simply decimate the data to 50 Hz by sampling the raw stream at fixed 20ms intervals inside each 1h chunk (no need for filtering).

This would be implemented as a property at the level of the reader.Encoder class so it can be toggled in the schema (or we can set the default to True).

For the video metadata one possibility we discussed to immediately gain some space is to drop the hw_counter column from the ingested data. We should be able to reconstruct dropped frames at both the TTL and software levels by comparing the camera timestamp with the harp timestamp directly, and it would give us back around 1/3 of the space.

If this sounds good I can submit a PR soon.

jkbhagatio commented 1 month ago

@glopesdev this sounds good regarding the changes to Encoder, happy for that immediately. Regarding the video metadata, can we discuss the hw_counter in a bit more detail?

glopesdev commented 1 month ago

@ttngu207 @jkbhagatio @lochhh I have a working implementation of the downsampling operation at https://github.com/SainsburyWellcomeCentre/aeon_mecha/commit/3340df25977b1c0826dce02848e5713fcb68c618.

However, I am now conflicted as to whether having downsample as a property of the reader / stream is a good idea. This came up because of testing. The few tests we have of the low-level API were actually using encoder data on a chunk which has non-monotonic timestamps and after resampling the behavior is changed.

This makes sense, but at the same time made me more aware that some of the more subtle issues with timestamping came from carefully inspecting this low-level raw data and I would like to retain the ability to easily query the raw data if necessary.

The problem is that if we implement downsample as a property at the level of the schema stream, then whoever calls aeon.load with that stream does not have any choice to see the raw data, since the flag is essentially frozen at schema creation time.

Therefore I would like to provide a way to override reader behavior at aeon.load time as well, when necessary. This made me think that a general way to support this behavior could be to simply have readers accept optional **kwargs arguments and have aeon.load forward any kwargs that it receives in its own call to the corresponding reader which is also passed in the function argument.

This would solve my dilemma with the encoder testing and future debugging, but also in general gives us a straightforward way to have options in readers which can be toggled at aeon.load time, rather than only by configuring the reader at schema creation time.

As an illustration, this would give us the following signatures and call (downsampling by default):

Reader

    def read(self, file, downsample=True):
        """Reads encoder data from the specified Harp binary file, and optionally downsamples
        the frequency to 50Hz.
        """

Load

# if we want to look at raw data we can optionally clear the downsample flag
patch_wheel = aeon.load(root, exp02.Patch2.Encoder, epoch=behavior_epoch, downsample=False)

Any thoughts on this? I've opened a PR with a proposed implementation.

ttngu207 commented 1 month ago

@glopesdev Thanks, the implementation looks good to me.

I'm fine with downsample be an additional argument for load, but it seems a bit one-off and may break the elegancy of the current implementation/usage of .load() (or not)

How about downsample still being a property of the reader, and add the get/set method for this property so it can be changed on the fly before calling the load() (and not frozen at the time of schema creation)?

ttngu207 commented 1 month ago

I've reviewed the PR, allowing any additional kwargs to be passed on into the load() is also very clean and potentially very useful!

jkbhagatio commented 1 month ago

Hi @glopesdev , this makes sense to me conceptually, but I might suggest using a specific arg name (that will be a dict) instead of kwargs, to differentiate args that will be passed to load alone vs. args that will be passed to a read via load.

I think this also implies base class Reader's read should accept this new dict param, that i guess by default can be set to None or an empty dict?

glopesdev commented 1 month ago

@jkbhagatio can you review directly in the PR? Just easier to iterate the code implementation there.