aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 691 forks source link

Command failed with exit code 10 on Glue Job #1176

Closed kev-dfs closed 2 years ago

kev-dfs commented 2 years ago

Hi everyone

I programmed a processing of data on Jupyter Notebook (SageMaker) with the awswrangler library. This code work perfectly in this enviorement but when I try run it on Glue, the code finish with the next error: Command Failed with exit code 10. This error in the Knowledge Center say that is an error by Memory. Then I runed a memory profile to check how many memory use the process and I find that the process use 25Gb of memory in a "pandas.merge" because the Dataframes are so big (more than 10 Gb each one). Next, I tryed create "categories" on the some columns for optimize the memory use, but when the code execute the "merge" again, this categories was lose. ¿How can I improve this? Is better change all for a Spark Job (Programmed in Spark)? I think that someone must haved this problem and could resolved it.

Please I need guidance. Thanks You.

jaidisido commented 2 years ago

It does seem that you are running into memory issues. I assume you are already using 1 DPU instead of the default 0.0625 DPU in Glue Python shell jobs? If so, then you would probably need to use PySpark instead. The library is not distributed at the moment I am afraid