Added a StorageIndex for the source storage to reduce LIST calls. Addresses part of #543. Before every restructuring or cleaning operation, the index is updated. The StorageIndex can be updated with a configurable sync time to make a full sync, otherwise it just updates directories that have files in them. A separate sync time can be set to also scan empty directories. The first implementation is only a InMemoryStorageIndex. For very large datasets, a file-based index might be needed. During partial updates, it uses the start-after flag in S3 to only list newer files than the last one scanned.
This is tested in radar-k3s-test and gives the following results:
old behaviour: 128 list operations, every time
full scan (once per hour, configurable): 110 list operations
partial update (most frequent): 17 operations
partial update including empty directories (once per 15 minutes, configurable): 97 operations
Added a StorageIndex for the source storage to reduce LIST calls. Addresses part of #543. Before every restructuring or cleaning operation, the index is updated. The StorageIndex can be updated with a configurable sync time to make a full sync, otherwise it just updates directories that have files in them. A separate sync time can be set to also scan empty directories. The first implementation is only a InMemoryStorageIndex. For very large datasets, a file-based index might be needed. During partial updates, it uses the start-after flag in S3 to only list newer files than the last one scanned.
This is tested in radar-k3s-test and gives the following results: old behaviour: 128 list operations, every time full scan (once per hour, configurable): 110 list operations partial update (most frequent): 17 operations partial update including empty directories (once per 15 minutes, configurable): 97 operations