Error: Container is running beyond physical memory limits

clifff commented 5 years ago

I'm trying to run some analysis on a collection of S3 Access Logs, and set up a Glue job using the steps in the README to do so. The set of logs is about 14 GB over 12.8 million files. Whenever I kick off the job, it runs for about 13 minutes and then fails with a Command failed with exit code 1 message. Looking at the logs, I see this line that seems important:

Diagnostics: Container [pid=11027,containerID=container_1569865532923_0001_01_000001] is running beyond physical memory limits. Current usage: 5.5 GB of 5.5 GB physical memory used; 7.7 GB of 27.5 GB virtual memory used. Killing container.

This is corroborated by CloudWatch metrics, which show the driver memory usage steadily climbing and the executor staying low.

Based on the athena_glue_service_logs blog post here, it seems like my volume of data is well within the expected limits. I retried the job after adding the --conf parameter set to spark.yarn.executor.memoryOverhead=1G, but it failed in the same way.

Any advice for getting this to work are appreciated - otherwise I'll follow the Glue documentation suggestion of writing a script to do the conversion using DynamicFrames.

dacort commented 5 years ago

Hi @clifff - there's a couple things you can try here.

1) Change the worker type in the Glue job to one with larger memory (G.1X - see screenshot below). 2) Try increasing the spark.yarn.executor.memoryOverhead even more, but there is only so far you can go with that. 3) You can also try increasing the driver memory, since you mention that's increasing. Set a parameter with the key --conf and value spark.driver.memory=10g.

I would recommend trying the first option as that will inherently give you more memory to work with.

clifff commented 5 years ago

Thanks for the tip @dacort! Didn't realize worker type was configurable like that. I upped to G.1X and let the job run again - churned for about 100 minutes before crashing again. Found this in the logs:

Log Contents:
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.job:Recurring run, only looking for recent partitions on raw catalog.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 10754"...
os::fork_and_exec failed: Cannot allocate memory (12)
End of LogType:stdout

Which matches what Cloudwatch is showing:

It seems promising it didn't hit a memory usage of 1 and immediately crash, but that does make me think it's not necessary to configure the driver to a specific amount. Went ahead and raised spark.yarn.executor.memoryOverhead to 2G and will try that out.

clifff commented 5 years ago

Confirm, timed out in about the same amount of time with the 2G setting.

dacort commented 5 years ago

OK, thanks for trying that @clifff - looks like building up the list of those 13M files is taking up quite the resources. Give me a few days to see if I can reproduce this in my own environment to see what options there might be. There's definitely still some more testing for these scripts at that scale.

clifff commented 5 years ago

Sounds good - thanks for looking into this @dacort! Happy to tweak settings/code and retry whenever.

clifff commented 5 years ago

@dacort - sorry to bump, but any update on this? Totally understand if not - I may take a go at loading these up on an EC2 instance with lots of RAM and attempting to dig at what I want w/ unix tools.

dacort commented 5 years ago

Hey @clifff - Unfortunately haven't been able to take a look much deeper. How high did you bump spark.driver.memory?

A couple other options:

Try increasing executor memory as well (don't think this is the issue tho): --conf spark.yarn.executor.memory=1g and keep increasing
Take a look at this doc about grouping input files. You'll have to modify converter.py and I'm not sure what the options are to read from the catalog, but you can test reading the options with an explicit S3 path to at least see if it works.

There is some more detail on debugging OOM issues here as well: https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-profile-debug-oom-driver

edit

I think you can specify the file grouping as an additional_options parameter to the from_catalog function. For example:

additional_options={"groupFiles": "inPartition"}

clifff commented 5 years ago

No worries! I actually was successful loading the logs onto an EC2 instance. Turns out the bucket inventory size was way off and it was more like 60 GB of logs... but the good news is I was able to filter it down to ~100 mb of relevant lines using ripgrep, and got the info I needed from there.

Will go ahed and close this for now since, but feel free to re-open if you want to track the issue further.

dacort commented 5 years ago

👍 Sounds good, thanks!

dacort commented 5 years ago

I didn't realize you were just trying to do a one-time query. For future reference, this library creates two tables - one for the "raw" unconverted data and another for the "optimized" parquet data. This appears to have been failing during the conversion process, but you still could have queried the raw data. But ripgrep for the win! One of my favorite tools.

RickardCardell commented 4 years ago

Hi So I've got a similar issue, that I couldn't make the job run on not that large dataset: 25GB s3 access logs/day for 30days.

I've tried with:

150 standard DPUs
100 G.1X
50 G.2X

all with many combinations of memory settings to no avail.

I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66 I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.

The job succeeded with 100 'standard' workers after only 4hours. The drawback is of course that more objects were created: between 50-140 per day-partition.

But for me at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.

Q: Would it make sense to make the repartitioning configurable? Another option is to have a (separate?) step that reduces the number of objects but more efficiently.

EDIT: added this as a separate issue instead: https://github.com/awslabs/athena-glue-service-logs/issues/21

awslabs / athena-glue-service-logs

Error: Container is running beyond physical memory limits #19