RADAR-base / radar-output-restructure

Reads avro files in HDFS and outputs json or csv per topic per user in local file system
Apache License 2.0
1 stars 0 forks source link

Added a StorageIndex for the source storage to reduce LIST calls #547

Closed blootsvoets closed 9 months ago

blootsvoets commented 10 months ago

Added a StorageIndex for the source storage to reduce LIST calls. Addresses part of #543. Before every restructuring or cleaning operation, the index is updated. The StorageIndex can be updated with a configurable sync time to make a full sync, otherwise it just updates directories that have files in them. A separate sync time can be set to also scan empty directories. The first implementation is only a InMemoryStorageIndex. For very large datasets, a file-based index might be needed. During partial updates, it uses the start-after flag in S3 to only list newer files than the last one scanned.

This is tested in radar-k3s-test and gives the following results: old behaviour: 128 list operations, every time full scan (once per hour, configurable): 110 list operations partial update (most frequent): 17 operations partial update including empty directories (once per 15 minutes, configurable): 97 operations