Overcoming memory issues by configurable repartition step

Hi We got some S3 access logs stored in S3 and we tried to use this lib but we were unable to make the job run on that dataset: 25GB s3 access logs/day for 30days.

I've tried with:

150 standard DPUs 100 G.1X 50 G.2X all with many combinations of memory settings to no avail.

I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66 I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.

The job succeeded with 100 'standard' workers after only 4hours. The drawback is of course that more objects were created: between 50-140 per day-partition. For smaller datasets the amount of files are higher: some thousands.

But for us at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.

Would it make sense to make the repartitioning step configurable? I.e being able to skip it. I can foresee that someone will mention the option to use coalesce instead of repartition. I have tried that already, and that only failed as well.

Another option is to have a (separate?) step that reduces the number of objects but more efficiently.

awslabs / athena-glue-service-logs

Overcoming memory issues by configurable repartition step #21