Closed clifff closed 5 years ago
Hi @clifff - there's a couple things you can try here.
1) Change the worker type in the Glue job to one with larger memory (G.1X - see screenshot below).
2) Try increasing the spark.yarn.executor.memoryOverhead
even more, but there is only so far you can go with that.
3) You can also try increasing the driver memory, since you mention that's increasing. Set a parameter with the key --conf
and value spark.driver.memory=10g
.
I would recommend trying the first option as that will inherently give you more memory to work with.
Thanks for the tip @dacort! Didn't realize worker type was configurable like that. I upped to G.1X
and let the job run again - churned for about 100 minutes before crashing again. Found this in the logs:
Log Contents:
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.job:Recurring run, only looking for recent partitions on raw catalog.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 10754"...
os::fork_and_exec failed: Cannot allocate memory (12)
End of LogType:stdout
Which matches what Cloudwatch is showing:
It seems promising it didn't hit a memory usage of 1
and immediately crash, but that does make me think it's not necessary to configure the driver to a specific amount. Went ahead and raised spark.yarn.executor.memoryOverhead
to 2G
and will try that out.
Confirm, timed out in about the same amount of time with the 2G
setting.
OK, thanks for trying that @clifff - looks like building up the list of those 13M files is taking up quite the resources. Give me a few days to see if I can reproduce this in my own environment to see what options there might be. There's definitely still some more testing for these scripts at that scale.
Sounds good - thanks for looking into this @dacort! Happy to tweak settings/code and retry whenever.
@dacort - sorry to bump, but any update on this? Totally understand if not - I may take a go at loading these up on an EC2 instance with lots of RAM and attempting to dig at what I want w/ unix tools.
Hey @clifff - Unfortunately haven't been able to take a look much deeper. How high did you bump spark.driver.memory
?
A couple other options:
--conf
spark.yarn.executor.memory=1g
and keep increasingconverter.py
and I'm not sure what the options are to read from the catalog, but you can test reading the options with an explicit S3 path to at least see if it works.There is some more detail on debugging OOM issues here as well: https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-profile-debug-oom-driver
edit
I think you can specify the file grouping as an additional_options
parameter to the from_catalog
function. For example:
additional_options={"groupFiles": "inPartition"}
No worries! I actually was successful loading the logs onto an EC2 instance. Turns out the bucket inventory size was way off and it was more like 60 GB of logs... but the good news is I was able to filter it down to ~100 mb of relevant lines using ripgrep
, and got the info I needed from there.
Will go ahed and close this for now since, but feel free to re-open if you want to track the issue further.
👍 Sounds good, thanks!
I didn't realize you were just trying to do a one-time query. For future reference, this library creates two tables - one for the "raw" unconverted data and another for the "optimized" parquet data. This appears to have been failing during the conversion process, but you still could have queried the raw data. But ripgrep
for the win! One of my favorite tools.
Hi So I've got a similar issue, that I couldn't make the job run on not that large dataset: 25GB s3 access logs/day for 30days.
I've tried with:
all with many combinations of memory settings to no avail.
I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66
I also had to add spark.hadoop.fs.s3.maxRetries=20
since it now makes quite a lot S3 calls which caused throttling.
The job succeeded with 100 'standard' workers after only 4hours. The drawback is of course that more objects were created: between 50-140 per day-partition.
But for me at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.
Q: Would it make sense to make the repartitioning configurable? Another option is to have a (separate?) step that reduces the number of objects but more efficiently.
EDIT: added this as a separate issue instead: https://github.com/awslabs/athena-glue-service-logs/issues/21
I'm trying to run some analysis on a collection of S3 Access Logs, and set up a Glue job using the steps in the README to do so. The set of logs is about 14 GB over 12.8 million files. Whenever I kick off the job, it runs for about 13 minutes and then fails with a
Command failed with exit code 1
message. Looking at the logs, I see this line that seems important:This is corroborated by CloudWatch metrics, which show the driver memory usage steadily climbing and the executor staying low.
Based on the
athena_glue_service_logs
blog post here, it seems like my volume of data is well within the expected limits. I retried the job after adding the--conf
parameter set tospark.yarn.executor.memoryOverhead=1G
, but it failed in the same way.Any advice for getting this to work are appreciated - otherwise I'll follow the Glue documentation suggestion of writing a script to do the conversion using DynamicFrames.