awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

Passing additional Python modules to the job - ModuleNotFoundError #173

Open mkangoor opened 1 year ago

mkangoor commented 1 year ago

I'm trying to run a Glue job (version 4) to perform a simple data batch processing. I'm using additional python libraries that Glue environment doesn't provide with - translate and langdetect. Additionally, regardless of the Glue env provides with nltk package, when I try to import it I keep receiving the error that dependencies are not found (e.g. regex._regex, _sqlite3).

I tried a few solutions to achieve my goal:

  1. using --extra-py-files where I specified path to s3 bucket where I uploaded either:
    • .zip file that consists of translate and langdetect python packages
    • just a directory for already unzipped packages
    • packages itself in .whl format (along with its dependencies)
  2. using --additional-python-modules where I specified path to s3 bucket where I uploaded:
    • packages itself in .whl format (along with its dependencies)
    • or just pinpoint which package has to be installed inside the glue env via pip3

Additionally, I followed a few valuable sources to overcome the issue of ModuleNotFoundError:

Also, I tried to play with the Glue versions 4 and 3 but haven't had luck. It seems like a bug. All permissions to read s3 bucket is granted to the glue role. The Python script version is the same as the libraries I'm trying to install - Python 3.