Open ttngu207 opened 2 months ago
I'm sharing here the current size (in GB) of each schema in the aeon-db2
database
{'aeon_streams': 186.279919616,
'aeon_archived_exp02_acquisition': 42.648174592,
'aeon_worker': 0.092962816,
'aeon_tracking': 32.597950464,
'aeon_archived_exp01_analysis': 0.010092544,
'aeon_analysis': 0.00188416,
'aeon_archived_exp02_subject': 0.000835584,
'aeon_archived_exp02_worker': 0.000131072,
'aeon_report': 4.9152e-05,
'aeon_block_analysis': 23.938367488,
'aeon_archived_exp02_report': 0.002752512,
'aeon_device': 3.2768e-05,
'u_thinh_aeonfix': 0.00475136,
'aeon_subject': 0.000638976,
'aeon_archived_exp02_streams': 0.000507904,
'aeon_archived_exp01_acquisition': 0.10944512,
'aeon_archived_exp02_qc': 24.861868032,
'aeon_archived_exp02_analysis': 4.933648384,
'aeon_lab': 0.000442368,
'aeon_archived_exp01_lab': 0.000425984,
'aeon_archived_exp01_subject': 0.000294912,
'aeon_archived_exp01_report': 0.041025536,
'aeon_archived_exp01_qc': 26.919780352,
'aeon_archived_exp02_lab': 0.000442368,
'aeon_acquisition': 0.121913344,
'aeon_archived_exp02_tracking': 9.951330304,
'aeon_qc': 4.260446208,
'aeon_archived_exp01_tracking': 1.720942592}
Ah yes, I'm happy downsampling the wheel data to 50 hz as something quick we can do immediately
@ttngu207 proposed 2 possible ways to downsample the wheel data:
processed
(but this is inconsistent with the current "read-only" way)processed
before ingestion@ttngu207 feel free to expand/correct my comment here
Hi @jkbhagatio , We discussed this issue a bit more with @lochhh and @anayapouget today.
Downsampling the wheel data to 50Hz is a good immediate step we can take to optimize the storage. On this, I think there are 2 ways to accomplish this
streams
, we have been maintaining this consistency, breaking it here is something to think aboutprocessed
folder - this will get ingested
processed
(perhaps together with, or after, the data copy step)Technically, at the streams
level, we want to maintain consistency between the ingested data with the "read" data - i.e. the dataframe returned by the StreamReader class (not strictly the data in the raw files).
So, perhaps a 3rd option to consider is to update the StreamReader for the wheel data (io.reader.Encoder
) to have a downsampling keyword argument (or downsampled_rate
), and we use this option for ingestion.
This approach would also require zero change in the current ingestion flow, only need to update the reader in aeon.schemas.schema
, and still maintain consistency between data returned by the reader and data ingested
@jkbhagatio @glopesdev Let us know your thoughts on this, if this is a reasonable approach, I'll start the implementation.
Currently, data ingestion for social0.4 is paused due to this storage problem
@ttngu207 for wheel data, number 3. sounds preferable for now, especially since it avoids data duplication.
For the video CSV it's a bit harder to imagine a compressed scheme, especially since it seems from the numbers that the stored data is already in binary form. The only other option would be to save timestamps only and table indices and pull data on demand post-query if needed.
I can push to implement the downsampling parameter ASAP for wheels since it seems this would unblock ingestion the fastest.
@glopesdev agreed, I also like option 3 best. The video CSV is definitely trickier, we can deal with that later
It would be great if we can prioritize the downsampling parameter in the io.reader.Encoder
, that would allow ingestion to resume.
what is the underlying database engine? innodb? is page compression enabled? you can also get savings depending on the underlying file system.
@ttngu207 @glopesdev, I'm also happy with option 3 mentioned above
@jkbhagatio from our meeting today we were converging on a downsampling strategy and API for the wheel data. Since the encoder reports absolute wheel position we can simply decimate the data to 50 Hz by sampling the raw stream at fixed 20ms intervals inside each 1h chunk (no need for filtering).
This would be implemented as a property at the level of the reader.Encoder
class so it can be toggled in the schema (or we can set the default to True
).
For the video metadata one possibility we discussed to immediately gain some space is to drop the hw_counter
column from the ingested data. We should be able to reconstruct dropped frames at both the TTL and software levels by comparing the camera timestamp with the harp timestamp directly, and it would give us back around 1/3 of the space.
If this sounds good I can submit a PR soon.
@glopesdev this sounds good regarding the changes to Encoder
, happy for that immediately. Regarding the video metadata, can we discuss the hw_counter
in a bit more detail?
@ttngu207 @jkbhagatio @lochhh I have a working implementation of the downsampling operation at https://github.com/SainsburyWellcomeCentre/aeon_mecha/commit/3340df25977b1c0826dce02848e5713fcb68c618.
However, I am now conflicted as to whether having downsample
as a property of the reader / stream is a good idea. This came up because of testing. The few tests we have of the low-level API were actually using encoder data on a chunk which has non-monotonic timestamps and after resampling the behavior is changed.
This makes sense, but at the same time made me more aware that some of the more subtle issues with timestamping came from carefully inspecting this low-level raw data and I would like to retain the ability to easily query the raw data if necessary.
The problem is that if we implement downsample
as a property at the level of the schema stream, then whoever calls aeon.load
with that stream does not have any choice to see the raw data, since the flag is essentially frozen at schema creation time.
Therefore I would like to provide a way to override reader behavior at aeon.load
time as well, when necessary. This made me think that a general way to support this behavior could be to simply have readers accept optional **kwargs
arguments and have aeon.load
forward any kwargs
that it receives in its own call to the corresponding reader which is also passed in the function argument.
This would solve my dilemma with the encoder testing and future debugging, but also in general gives us a straightforward way to have options in readers which can be toggled at aeon.load
time, rather than only by configuring the reader at schema creation time.
As an illustration, this would give us the following signatures and call (downsampling by default):
def read(self, file, downsample=True):
"""Reads encoder data from the specified Harp binary file, and optionally downsamples
the frequency to 50Hz.
"""
# if we want to look at raw data we can optionally clear the downsample flag
patch_wheel = aeon.load(root, exp02.Patch2.Encoder, epoch=behavior_epoch, downsample=False)
Any thoughts on this? I've opened a PR with a proposed implementation.
@glopesdev Thanks, the implementation looks good to me.
I'm fine with downsample
be an additional argument for load
, but it seems a bit one-off and may break the elegancy of the current implementation/usage of .load()
(or not)
How about downsample
still being a property of the reader, and add the get/set
method for this property so it can be changed on the fly before calling the load()
(and not frozen at the time of schema creation)?
I've reviewed the PR, allowing any additional kwargs
to be passed on into the load()
is also very clean and potentially very useful!
Hi @glopesdev , this makes sense to me conceptually, but I might suggest using a specific arg name (that will be a dict) instead of kwargs, to differentiate args that will be passed to load
alone vs. args that will be passed to a read
via load
.
I think this also implies base class Reader
's read
should accept this new dict param, that i guess by default can be set to None or an empty dict?
@jkbhagatio can you review directly in the PR? Just easier to iterate the code implementation there.
Data size stored in mysql server (on
aeon-db2
) is approaching 400GB This will hit the storage size limit of theaeon-db2
server, at which point we will need to request for a disk size increase.I'd like to discuss potential solutions to better engineer/optimize for storage space in our next data architecture meeting.
The bulk of the storage is in the
streams
schema, where we extract and store the data from relevant streams. Among these, the majority of the storage is atUndergroundFeederEncoder
(the wheel data) (~100GB) andSpinnakerVideoSourceVideo
(~80GB)