aws-samples / amazon-sagemaker-local-mode

Amazon SageMaker Local Mode Examples
MIT No Attribution
242 stars 59 forks source link

Missing resourceconfig.json #6

Closed danibendi-edgify closed 2 years ago

danibendi-edgify commented 3 years ago

Hi Eitan,

I'm having trouble with running the Pytorch script locally. I'm getting the following error:

Training and evaluation datasets exist Starting model training Creating kachor9sfh-algo-1-uuf2f ... Attaching to kachor9sfh-algo-1-uuf2f kachor9sfh-algo-1-uuf2f | Reporting training FAILURE kachor9sfh-algo-1-uuf2f | framework error: kachor9sfh-algo-1-uuf2f | Traceback (most recent call last): kachor9sfh-algo-1-uuf2f | File "/usr/local/lib/python3.7/site-packages/sagemaker_training/trainer.py", line 66, in train kachor9sfh-algo-1-uuf2f | env = environment.Environment() kachor9sfh-algo-1-uuf2f | File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 498, in init kachor9sfh-algo-1-uuf2f | resource_config = resource_config or read_resource_config() kachor9sfh-algo-1-uuf2f | File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 239, in read_resource_config kachor9sfh-algo-1-uuf2f | return _read_json(resource_config_file_dir) kachor9sfh-algo-1-uuf2f | File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 191, in _read_json kachor9sfh-algo-1-uuf2f | with open(path, "r") as f: kachor9sfh-algo-1-uuf2f | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json' kachor9sfh-algo-1-uuf2f | kachor9sfh-algo-1-uuf2f | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json' kachor9sfh-algo-1-uuf2f exited with code 2

I'm running in Ubuntu 18.04.

Any ideas?

eitansela commented 3 years ago

Tried it with a fresh Ubuntu 18.04 EC2. Ran: pip install boto3 sagemaker pandas matplotlib torch torchvision pip install 'sagemaker[local]'

It worked flawlessly.

One possibility is that the SageMaker SDK is old, thus fetching an old Docker image. Can you please post the output of: pip3 freeze command?

Can you post the output of docker images command?

danibendi-edgify commented 3 years ago

pip3 freeze attrs==20.3.0 bcrypt==3.2.0 boto3==1.16.56 botocore==1.19.56 cached-property==1.5.2 certifi==2020.12.5 cffi==1.14.4 chardet==4.0.0 click==7.1.2 cryptography==3.3.1 cycler==0.10.0 distro==1.5.0 docker==4.4.1 docker-compose==1.27.4 dockerpty==0.4.1 docopt==0.6.2 Flask==1.1.1 gevent==21.1.1 google-pasta==0.2.0 greenlet==1.0.0 gunicorn==20.0.4 idna==2.10 importlib-metadata==3.4.0 inotify-simple==1.2.1 itsdangerous==1.1.0 Jinja2==2.11.2 jmespath==0.10.0 jsonschema==3.2.0 kiwisolver==1.3.1 MarkupSafe==1.1.1 matplotlib==3.3.3 numpy==1.19.5 packaging==20.8 pandas==1.2.0 paramiko==2.7.2 Pillow==8.1.0 protobuf==3.14.0 protobuf3-to-dict==0.1.5 psutil==5.8.0 pycparser==2.20 PyNaCl==1.4.0 pyparsing==2.4.7 pyrsistent==0.17.3 python-dateutil==2.8.1 python-dotenv==0.15.0 pytz==2020.5 PyYAML==5.3.1 requests==2.25.1 retrying==1.3.3 s3transfer==0.3.4 sagemaker==2.23.5 scipy==1.6.0 six==1.15.0 smdebug-rulesconfig==1.0.1 texttable==1.6.3 torch==1.7.1 torchvision==0.8.2 typing==3.7.4.3 typing-extensions==3.7.4.3 urllib3==1.26.2 websocket-client==0.57.0 Werkzeug==1.0.1 zipp==3.4.0 zope.event==4.5.0 zope.interface==5.2.0

docker images REPOSITORY TAG IMAGE ID CREATED SIZE 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training 1.4.0-cpu-py3 9aae29fca0da 7 months ago 3.68GB

eitansela commented 3 years ago

The problem seems to be with the pytorch-training Docker image, which is 1.4.0, and it is quite old (7 months). In this example I use framework_version='1.6.0' in the PyTorch Estimator, so the pytorch-training Docker image would be: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training 1.6.0-cpu-py3 10bf6569f174 8 weeks ago 3.92GB

Did you modify the framework_version in your code to 1.4.0?

danibendi-edgify commented 3 years ago

Hi Eitan,

you're right, I had an old code because I used this repo: https://github.com/eitansela/eitans-sagemaker-examples which is not as up-to-date as this one. I pulled this repo and ran the code with the updated version. I'm still getting the error, and now getting others in addition (missing 'changehostname.o' file):

INFO:sagemaker.local.image:docker command: docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3 INFO:sagemaker.local.image:image pulled: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3 Creating 4j1p4rtc47-algo-1-pc7sm ... Attaching to 4j1p4rtc47-algo-1-pc7sm 4j1p4rtc47-algo-1-pc7sm | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory 4j1p4rtc47-algo-1-pc7sm | changehostname.c: In function ‘gethostname’: 4j1p4rtc47-algo-1-pc7sm | changehostname.c:15:21: error: expected expression before ‘;’ token 4j1p4rtc47-algo-1-pc7sm | const char *val = ; 4j1p4rtc47-algo-1-pc7sm | ^ 4j1p4rtc47-algo-1-pc7sm | gcc: error: changehostname.o: No such file or directory 4j1p4rtc47-algo-1-pc7sm | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. 4j1p4rtc47-algo-1-pc7sm | Reporting training FAILURE 4j1p4rtc47-algo-1-pc7sm | framework error: 4j1p4rtc47-algo-1-pc7sm | Traceback (most recent call last): 4j1p4rtc47-algo-1-pc7sm | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 66, in train 4j1p4rtc47-algo-1-pc7sm | env = environment.Environment() 4j1p4rtc47-algo-1-pc7sm | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/environment.py", line 498, in init 4j1p4rtc47-algo-1-pc7sm | resource_config = resource_config or read_resource_config() 4j1p4rtc47-algo-1-pc7sm | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/environment.py", line 239, in read_resource_config 4j1p4rtc47-algo-1-pc7sm | return _read_json(resource_config_file_dir) 4j1p4rtc47-algo-1-pc7sm | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/environment.py", line 191, in _read_json 4j1p4rtc47-algo-1-pc7sm | with open(path, "r") as f: 4j1p4rtc47-algo-1-pc7sm | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json' 4j1p4rtc47-algo-1-pc7sm | 4j1p4rtc47-algo-1-pc7sm | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json' 4j1p4rtc47-algo-1-pc7sm exited with code 2 Aborting on container exit...

danibendi-edgify commented 3 years ago

docker images REPOSITORY TAG IMAGE ID CREATED SIZE 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training 1.6.0-cpu-py3 10bf6569f174 8 weeks ago 3.92GB 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training 1.4.0-cpu-py3 9aae29fca0da 7 months ago 3.68GB