Open clianga opened 2 years ago
Also similar to #950
@clianga do you know what versions of gluonts
, mxnet
and CUDA you are running?
Hi @lostella , thank you for the instant feedback. I just checked the versions by adding print(gluonts.version) and print(mxnet.version) in the py file and check logs through clockwatch, both showed version is 0.8.1. For CUDA, I'm not familiar with GPU settings, could you tell me how could I print the version?
@clianga the name of the MXNet package that you have installed should tell it: for example, mxnet-cu92mkl
means that the package has CUDA 9.2 with MKL-DNN enabled
Hi @lostella , I checked the py file used to run this job. It seems I don't have a pip install for MXNet. Maybe the sagemaker instance itself have MXNet installed? My py-file code contains MXNet are:
import os
os.system('pip install pandas')
os.system('pip install gluonts')
import pandas as pd
import pathlib
import gluonts
import numpy as np
import argparse
import json
import boto3
from mxnet.context import gpu, cpu
from mxnet.context import num_gpus, gpu, cpu
from gluonts.dataset.util import to_pandas
from gluonts.model.deepar import DeepAREstimator
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.model.lstnet import LSTNetEstimator
from gluonts.model.seq2seq import MQCNNEstimator
from gluonts.model.transformer import TransformerEstimator
from gluonts.evaluation.backtest import make_evaluation_predictions, backtest_metrics
from gluonts.evaluation import Evaluator
from gluonts.model.predictor import Predictor
from gluonts.dataset.common import ListDataset
from gluonts.dataset.field_names import FieldName
from gluonts.mx.trainer import Trainer
from gluonts.dataset.multivariate_grouper import MultivariateGrouper
from smdebug.mxnet import Hook
s3 = boto3.client("s3")
def uploadDirectory(model_dir,prefix,bucket):
for root,dirs,files in os.walk(model_dir):
for file in files:
print(os.path.join(root,file))
print(prefix+file)
s3.upload_file(os.path.join(root,file),bucket,prefix+file)
In the Sagemaker studio, I run
import sagemaker
from sagemaker.mxnet import MXNet
mxnet_estimator = MXNet(entry_point='blog_train_algos.py',
role=sagemaker.get_execution_role(),
instance_type='ml.p3.2xlarge',
instance_count=1,
framework_version='1.7.0',
py_version='py3',
hyperparameters={'bucket': bucket,
'seq': trial.trial_name,
'algo': "seq2seq",
'freq': "D",
'prediction_length': 30,
'epochs': 10,
'learning_rate': 1e-3,
'hybridize': False,
'num_batches_per_epoch': 10,
})
I'm using Sagemaker Studio to train a MQCNN model, under default layer settings model runs without any error using CPU instance. But once I switched to 'ml.p3.2xlarge' instance and change ctx from 'cpu' to 'gpu', the loss in each Epoch becomes NaN and the training process stopped. I saw a similar issue on 501 but it seems never been solved. [https://github.com/awslabs/gluon-ts/issues/501]. Here's the log file on CPU instance
and GPU instance:
The parameters of the two training job is exactly the same except ctx = 'cpu' and ctx = 'gpu'.
My code is based on this AWS blog and only change a few parameter settings. https://aws.amazon.com/blogs/machine-learning/training-debugging-and-running-time-series-forecasting-models-with-the-gluonts-toolkit-on-amazon-sagemaker/
Please help, thank you!