OSOceanAcoustics / echodataflow

Orchestrated sonar data processing workflow
https://echodataflow.readthedocs.io/en/latest/
MIT License
4 stars 1 forks source link

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Open Sohambutala opened 4 months ago

Sohambutala commented 4 months ago

This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.

Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.

Proposed Solutions for Discussion:

  1. Custom Serialization:

    Pros: Allows tighter control over how data is managed and passed between stages. Cons: Requires additional development effort and maintenance. Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.

  2. Compression:

    Pros: Reduces the physical size of serialized files, potentially decreasing I/O time. Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data. Implementation: Implement compression algorithms suited to our data types and processing needs.

  3. Enhanced Caching Strategy:

    Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas. Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk. Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.

Any insights, comments or criticism is welcomed.