This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.
Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.
Proposed Solutions for Discussion:
Custom Serialization:
Pros: Allows tighter control over how data is managed and passed between stages.
Cons: Requires additional development effort and maintenance.
Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:
Pros: Reduces the physical size of serialized files, potentially decreasing I/O time.
Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data.
Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:
Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas.
Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk.
Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.
This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.
Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.
Proposed Solutions for Discussion:
Custom Serialization:
Pros: Allows tighter control over how data is managed and passed between stages. Cons: Requires additional development effort and maintenance. Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:
Pros: Reduces the physical size of serialized files, potentially decreasing I/O time. Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data. Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:
Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas. Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk. Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.
Any insights, comments or criticism is welcomed.