RADAR-base / radar-output-restructure

Reads avro files in HDFS and outputs json or csv per topic per user in local file system
Apache License 2.0
1 stars 0 forks source link

Optimise to make fewer S3 API calls #543

Open yatharthranjan opened 1 year ago

yatharthranjan commented 1 year ago

it could be optimised by -

  1. not rescanning the full directory hierarchy to find new topics, but limiting that to once per 15 minutes (configurable).
  2. storing the actual file directories (partition=0, etc) to check for updates, instead of the one level above the file directory. Right now only the topic directory is stored.
  3. Keeping in memory the last object that was scanned. Then do not list all files but only newly added files using ListObjectsV2.start-after. Note that you will still want to do a full scan sometimes to avoid files being skipped that were added at a later time or not deemed complete by the cleaner.
keyvaann commented 10 months ago

@blootsvoets if you had time could you have a look here to see if there is an easy way to reduce S3 api calls?