aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.09k stars 6.76k forks source link

Another_scikit_bring_your_own: Failed Reason: AlgorithmError: Exit Code: 1 #415

Closed davidmhgregory closed 6 years ago

davidmhgregory commented 6 years ago

Hello, this issue may be the same as in #372. I was running the scikit_bring_your_own.ipynb notebook as a tutorial. I have not changed any code. When I get to the cell: account = sess.boto_session.client('sts').get_caller_identity()['Account'] region = sess.boto_session.region_name image = '{}.dkr.ecr.{}.amazonaws.com/decision-trees-sample:latest'.format(account, region) tree = sage.estimator.Estimator(image, role, 1, 'ml.c4.2xlarge', output_path="s3://{}/output".format(sess.default_bucket()), sagemaker_session=sess) tree.fit(data_location)

I get the error: Error training decision-trees-sample-2018-09-19-23-24-06-866: Failed Reason: AlgorithmError: Exit Code: 1.

I also see above it: ImportError: libopenblasp-r0-39a31c03.2.18.so: cannot open shared object file: No such file or directory

I was looking for a reason and found https://github.com/numpy/numpy/issues/8076 . That link make me think it may have something to do with way numpy is linked together with other packages??? Wondering if anyone else has seen this issue? Thanks, David

davidmhgregory commented 6 years ago

Looks like I solved my own problem by deleting a line from the Dockerfile: (cd /usr/local/lib/python2.7/dist-packages/scipy/.libs; rm ; ln ../../numpy/.libs/ .) && \

This might not be optimal, the the code is working for me now. Thanks.

mvsusp commented 6 years ago

Hi @davidmhgregory,

I tried to reproduce your issue without success. It seems that any numpy related issue in the Dockerfile might be fixed for now.

I am closing this ticket, feel free to reopen it if you have additional questions.

Thanks for using SageMaker!

mwessman commented 5 years ago

I've been receiving this error with the build-your-own scikit example when training the model: ImportError: libopenblasp-r0-8dca6697.3.0.dev.so: cannot open shared object file: No such file or directory I've been trying to find the solution to this issue for hours, and removing the line from the Dockerfile that @davidmhgregory mentioned solved this issue.
Hopefully someone can look into why I received that error when trying to train my model.

VivianMagri commented 5 years ago

I was having the same kind of error in my execution. Not with the example, that worked ok. But when I modified the files to adapt them to my algorithm and tried to train my image. I needed to import a module from statsmodels and, though I had installed the package when building the container, it would still result in the error when the import command was being executed on the code. It was also solved by removing the mentioned line from the Dockerfile. I suppose it happen because that line restricts the installations of modules that may be necessary for other implementations or for running in different conditions... (just guessing. I'm kinda new to this)

davidmhgregory commented 5 years ago

I find that I get this error whenever the Numpy version in the docker file is not set to the most recent version of Numpy available. So or instance: RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \ pip install numpy==1.14.5 scipy scikit-learn pandas flask gevent gunicorn && \ (cd /usr/local/lib/python2.7/dist-packages/scipy/.libs; rm ; ln ../../numpy/.libs/ .) && \ rm -rf /root/.cache

will result in a error. The best fix is to update numpy from 1.14.5 to 1.16.1 (at the time of this message). But you will need to update the version number as needed. RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \ pip install numpy==1.16.1 scipy scikit-learn pandas flask gevent gunicorn && \ (cd /usr/local/lib/python2.7/dist-packages/scipy/.libs; rm ; ln ../../numpy/.libs/ .) && \ rm -rf /root/.cache

Alternately, one can still delete the line starting with "(cd" ; however, I believe this will defeat the linking optimization SageMaker included: RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \ pip install numpy==1.14.5 scipy scikit-learn pandas flask gevent gunicorn && \ rm -rf /root/.cache

mvsusp commented 5 years ago

Hi @davidmhgregory, @VivianMagri, and @mwessman,

The code changes merged https://github.com/awslabs/amazon-sagemaker-examples/pull/645 are freezing numpy and scipy and it solves the issue.

Thanks for using SageMaker!

Dhiraj223 commented 1 month ago

Hey, I am also getting the same error, and its because of The training Paths are not be set properly.

Here is the error : To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Traceback (most recent call last): File "/opt/ml/code/train_tensorflow_cnn.py", line 107, in Training data directory: None Test data directory: None main() File "/opt/ml/code/train_tensorflow_cnn.py", line 61, in main train_images, train_labels = load_mnist_data(train_data_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/ml/code/train_tensorflow_cnn.py", line 19, in load_mnist_data bucket_name = data_dir.split('/')[2] ^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'split'

This is my training file :

import argparse
import os
import json
import tensorflow as tf
from keras import layers, models
import boto3
import pandas as pd
import numpy as np
from PIL import Image
import io
from urllib.parse import urlparse

def load_mnist_data(data_dir):
    s3 = boto3.client('s3')
    images = []
    labels = []

    # Parse S3 URI
    bucket_name = data_dir.split('/')[2]
    prefix = '/'.join(data_dir.split('/')[3:])

    # Load labels
    response = s3.get_object(Bucket=bucket_name, Key=f"{prefix}/labels.csv")
    labels_df = pd.read_csv(io.BytesIO(response['Body'].read()))

    # Load images
    for _, row in labels_df.iterrows():
        image_key = f"{prefix}/images/{row['filename']}"
        response = s3.get_object(Bucket=bucket_name, Key=image_key)
        image = Image.open(io.BytesIO(response['Body'].read())).convert('L')
        images.append(np.array(image))
        labels.append(row['label'])

    return np.array(images), np.array(labels)

def main():
    parser = argparse.ArgumentParser()

    # SageMaker specific arguments
    parser.add_argument('--hosts', type=list, default=json.loads(os.environ.get('SM_HOSTS', '[]')))
    parser.add_argument('--current-host', type=str, default=os.environ.get('SM_CURRENT_HOST', ''))
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR', '/opt/ml/model'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', ' '))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST', ' '))
    parser.add_argument('--num-gpus', type=int, default=os.environ.get('SM_NUM_GPUS', 0))

    # Hyperparameters
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--learning-rate', type=float, default=0.001)

    args, _ = parser.parse_known_args()

    train_data_dir = os.environ.get('SM_CHANNEL_TRAIN')
    test_data_dir = os.environ.get('SM_CHANNEL_TEST')

    print(f"Training data directory: {train_data_dir}")  # This should print the S3 path or local path in SageMaker
    print(f"Test data directory: {test_data_dir}")

    train_images, train_labels = load_mnist_data(train_data_dir)
    test_images, test_labels = load_mnist_data(test_data_dir)

    # Normalize images
    train_images = train_images.astype('float32') / 255
    test_images = test_images.astype('float32') / 255

    # Reshape images for CNN input
    train_images = train_images.reshape((-1, 28, 28, 1))
    test_images = test_images.reshape((-1, 28, 28, 1))

    # Convert labels to categorical
    train_labels = tf.keras.utils.to_categorical(train_labels, 10)
    test_labels = tf.keras.utils.to_categorical(test_labels, 10)

    # Define the CNN model
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    # Train the model
    model.fit(train_images, train_labels,
              epochs=args.epochs,
              batch_size=args.batch_size,
              validation_split=0.1,
              verbose=1)

    # Evaluate the model
    test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
    print(f'\nTest accuracy: {test_acc}')

    # Save the model
    model.save(os.path.join(args.model_dir, '1'))  # SageMaker expects model artifacts in a numbered subdirectory

if __name__ == '__main__':
    main()

And here is the sagemaker training job file :

import sagemaker
from sagemaker.estimator import Estimator
import boto3
import os

# Set up SageMaker session and role
sagemaker_session = sagemaker.Session()
role = 'arn:aws:iam::my-id:role/SageMakerExecutionRole'

# Define the container image URI
container_uri = 'my-id.dkr.ecr.ap-south-1.amazonaws.com/my-sagemaker-cnn:latest'

# S3 bucket for output
bucket = 'xxxxxxxx' 
prefix = 'mnist'

# Set up the estimator
estimator = Estimator(
    image_uri=container_uri,
    role=role,
    instance_count=1,
    instance_type='ml.c5.2xlarge',  # or another GPU instance type
    volume_size=30,
    output_path=f's3://{bucket}/{prefix}/output/',
    hyperparameters={
        'epochs': 100,
        'batch-size': 64,
        'learning-rate': 0.001
    },
    sagemaker_session=sagemaker_session
)

# Define input channels
train_data = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket_name}/train/',
    content_type='application/x-image'
)
test_data = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket_name}/test/',
    content_type='application/x-image'
)

# Fit the model
estimator.fit({'train': train_data, 'test': test_data})

# Print job name for reference
print(f"Training job name: {estimator.latest_training_job.job_name}")

Thanks