developmentseed / label-maker

Data Preparation for Satellite Machine Learning
http://devseed.com/label-maker/
MIT License
460 stars 110 forks source link

Sagemaker 'NoneType object' issue with data in 'walkthrough-classification-mxnet-sagemaker' example #90

Open joshwapiano opened 6 years ago

joshwapiano commented 6 years ago

I've been following the walkthough found here (albeit with a smaller bounding box), and have initiated a Sagemaker Notebook instance. The data.npz file is sitting in the sagemaker folder, and I'm having no problem reading it when running the relevant sections of mx_lenet_sagemaker.py in a new notebook on the instance, however when I run the second cell of SageMaker_mx-lenet I hit the following error:

ValueError: Error training sagemaker-mxnet-2018-07-08-18-12-13-217: Failed Reason: AlgorithmError: uncaught exception during training: 'NoneType' object has no attribute 'read'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/mxnet_container/train.py", line 191, in train
    model = user_module.train(**kwargs_to_pass)
  File "/opt/ml/code/mx_lenet_sagemaker.py", line 92, in train
    train_iter, val_iter = prep_data(data_path)
  File "/opt/ml/code/mx_lenet_sagemaker.py", line 14, in prep_data
    data = np.load(find_file(data_path, 'data.npz'))
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 402, in load
    magic = fid.read(N)
AttributeError: 'NoneType' object has no attribute 'read'

After several hours trying different fixes I'm having little to no luck debugging, but was hoping you could check the example to ensure it runs fine when you attempt it?

Geoyi commented 6 years ago

@joshwapiano, I guess something happened with the data (it's none), and it might be caused by multiple problems.

Let me know if we can help further.

joshwapiano commented 6 years ago

@Geoyi Thanks for getting back to me. I had written a much longer response, but for some reason GitHub has not saved this comment.

Essentially I have investigated both problem 1 and problem 2 and both give the expected results.

I have a feeling that the issue is with the S3 Bucket, and have tried multiple approaches on this. None of which have been successful. Would you be able to run the example on your own sagemaker notebook instance and if it functions as expected share the syntax/approach used in the mxnet_estimator.fit argument? Or any other changes you make?

Many thanks

mapmeld commented 6 years ago

I'm getting this same problem.

Geoyi commented 6 years ago

@joshwapiano and @mapmeld, I will spin up the sagemaker notebook and give a check next week. Let me know if you solve the problem before I get back to you. Sorry for the delay.

joshwapiano commented 6 years ago

@Geoyi thanks for getting back to us - still not having any luck producing the correct data format/feed for the sagemaker notebook. I think they have made changes to sagemaker/mxnet without providing adequate documentation. Good luck, looking forward to hearing from you.

Geoyi commented 6 years ago

Phewwww, I finally solved the problem and took me a whole morning today, @joshwapiano, and @mapmeld. You're right about S3 bucket and prep_data(find_file( ... )), @mapmeld. I deleted find_file( ... ) function. And @joshwapiano, SageMaker team dosen't do a good job of documenting their work.

Additional things I did:

Here are the scripts to replace the scripts in this notebook:

%%file mx_lenet_sagemaker.py
### replace this to the first cell

import logging
from os import path as op
import os

import mxnet as mx
import numpy as np
import boto3

batch_size = 64
num_cpus = 0
num_gpus = 1

s3_url = "Your_s3_bucket_URL"
s3_client = boto3.client('s3')
s3_client.download_file('Buket-name', "data.npz", "data.npz")

def prep_data():
    """
    Convert numpy array to mx Nd-array.
    Parameters
    ----------
    path: the directory that save data.npz.
    """
    data_file = np.load(op.join(os.getcwd(), 'data.npz'))
    x_train = data_file['x_train']
    y_train = data_file['y_train'][:,:1] ## only take the second column of y_train
    x_test = data_file['x_test']
    y_test = data_file['y_test'][:,:1]
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    print(x_train.shape, x_train.mean())
    img_mean = np.mean(x_train, axis=(0, 1, 2))
    img_std = np.std(x_train, axis=(0, 1, 2))

    x_train -= img_mean
    x_train /= img_std
    x_test -= img_mean
    x_test /= img_std

    img_rows = 256
    img_cols = 256

    x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols) ## reshape it to (448, ) instead of (448,1)
    x_test = x_test.reshape(x_test.shape[0], 3, img_rows, img_cols)
    y_train = y_train.reshape(y_train.shape[0], )
    y_test = y_test.reshape(y_test.shape[0], )
    print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

    train_iter = mx.io.NDArrayIter(x_train, y_train, batch_size, shuffle=True)
    val_iter = mx.io.NDArrayIter(x_test, y_test, batch_size)

    return train_iter, val_iter

def mx_lenet():
    """Building a three layer LeNet sytle Convolutional Neural Net using MXNet."""
    data = mx.sym.var('data')
    data_dp = mx.symbol.Dropout(data, p = 0.2) ## 20% of the input that gets dropped out during training time
    # first conv layer
    conv1 = mx.sym.Convolution(data=data_dp, kernel=(5, 5), num_filter=20)
    tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2, 2), stride=(2, 2))
    # second conv layer
    conv2 = mx.sym.Convolution(data=pool1, kernel=(5, 5), num_filter=50)
    tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2, 2), stride=(2, 2))

    # third conv layer
    conv3 = mx.sym.Convolution(data=pool1, kernel=(5, 5), num_filter=50)
    tanh3 = mx.sym.Activation(data=conv2, act_type="tanh")
    pool3 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2, 2), stride=(2, 2))

    # first fullc layer
    flatten = mx.sym.flatten(data=pool3)
    fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
    tanh4 = mx.sym.Activation(data=fc1, act_type="tanh")
    # second fullc
    fc2 = mx.sym.FullyConnected(data=tanh4, num_hidden=2)
    # softmax loss
    return mx.sym.SoftmaxOutput(data=fc2, name='softmax')

def train(num_cpus, num_gpus, **kwargs):
    """
    Train the image classification neural net.
    Parameters
    ----------
    num_cpus: If train the model on an aws GPS machine, num_cpus = 0 and num_gpus = 1, vice versa.
    num_gpus: apply to the same rule above
    """
    train_iter, val_iter = prep_data()
    lenet = mx_lenet()
    lenet_model = mx.mod.Module(
        symbol=lenet,
        context=get_train_context(num_cpus, num_gpus))
    logging.getLogger().setLevel(logging.DEBUG)
    lenet_model.fit(train_iter,
                    eval_data=val_iter,
                    optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1},
                    eval_metric='acc',
                    batch_end_callback=mx.callback.Speedometer(batch_size, 16),
                    num_epoch=100)
    return lenet_model

def get_train_context(num_cpus, num_gpus):
    """
    Define the model training instance.
    Parameters
    ----------
    num_cpus: If train the model on an aws GPS machine, num_cpus = 0 and num_gpus = 1, vice versa.
    num_gpus: apply to the same rule above
    """
    if num_gpus > 0:
        return mx.gpu()
    return mx.cpu()

def get_train_context(num_cpus, num_gpus):
    if num_gpus > 0:
        print("It's {} instance".format(num_gpus))
        return mx.gpu()
    print("It's {} instance".format(num_cpus))
    return mx.cpu()

and do this to the second cell:

%%time
from  sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

s3_url = "Your_s3_bucket_URL"
mxnet_estimator = MXNet("mx_lenet_sagemaker.py", 
                        role=get_execution_role(), 
                        output_path= s3_url,
                        train_instance_type="ml.p2.xlarge", 
                        train_instance_count=1)

mxnet_estimator.fit(s3_url)
mapmeld commented 6 years ago

@Geoyi this works for me - thank you so much for fixing this!

joshwapiano commented 6 years ago

@Geoyi Many thanks for providing this! Looking forward to trying it out. Will let you know how I get on! I've come across an issue with the labelling that label-maker is producing, will raise in separate #issue.