Issues with GluonTS library (MXNet Error)

jfrank94 commented 3 years ago

I'm running into an error when running the DeepAREstimator from the GluonTS library. I have downloaded all of the necessary packages, and with using the mx.gpu() function, it recognizes that the GPU (Nvidia Cuda 11.2) exists on the system

Note, that I'm able to run the code fine on Colab without having to run specific commands like installing "libquadmath0" or the NCCL library (ver 2.8.4 for cuda 11.2), but when running on a docker image, this seems to be the case.

Here's the full error trace:

`0%| | 0/1 [00:00<?, ?it/s] learning rate from "lr_scheduler" has been overwritten by "learning_rate" in optimizer. 0%| | 0/1 [00:03<?, ?it/s]

MXNetError Traceback (most recent call last)

in 29 print("\nPatient {} - Amount of Days (Train): {}\n | Amount of Days (Valid): {}\n".format(p_id, train_days, valid_days)) 30 ---> 31 train1_output = estimator.train(training_data=training_data, validation_data=validation_data) 32 #print(agg_metrics) /usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in train(self, training_data, validation_data, num_workers, num_prefetch, shuffle_buffer_length, cache_data, **kwargs) 205 num_prefetch=num_prefetch, 206 shuffle_buffer_length=shuffle_buffer_length, --> 207 cache_data=cache_data, 208 ).predictor /usr/local/lib/python3.6/dist-packages/gluonts/mx/model/estimator.py in train_model(self, training_data, validation_data, num_workers, num_prefetch, shuffle_buffer_length, cache_data) 177 net=training_network, 178 train_iter=training_data_loader, --> 179 validation_iter=validation_data_loader, 180 ) 181 /usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in __call__(self, net, train_iter, validation_iter) 377 epoch_no, 378 train_iter, --> 379 num_batches_to_use=self.num_batches_per_epoch, 380 ) 381 if is_validation_available: /usr/local/lib/python3.6/dist-packages/gluonts/mx/trainer/_base.py in loop(epoch_no, batch_iter, num_batches_to_use, is_training) 308 batch_size = loss.shape[0] 309 --> 310 if not np.isfinite(ndarray.sum(loss).asscalar()): 311 logger.warning( 312 "Batch [%d] of Epoch[%d] gave NaN loss and it will be ignored", /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in asscalar(self) 2583 raise ValueError("The current array is not a scalar") 2584 if self.ndim == 1: -> 2585 return self.asnumpy()[0] 2586 else: 2587 return self.asnumpy()[()] /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py in asnumpy(self) 2564 self.handle, 2565 data.ctypes.data_as(ctypes.c_void_p), -> 2566 ctypes.c_size_t(data.size))) 2567 return data 2568 /usr/local/lib/python3.6/dist-packages/mxnet/base.py in check_call(ret) 244 """ 245 if ret != 0: --> 246 raise get_last_ffi_error() 247 248 MXNetError: Traceback (most recent call last): File "../include/mshadow/././././cuda/tensor_gpu-inl.cuh", line 129 Name: Check failed: err == cudaSuccess (209 vs. 0) : MapPlanKernel ErrStr:no kernel image is available for execution on the device`

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

TristonC commented 3 years ago

@jfrank94 Which docker image did you use? Could you share more details about how to reproduce the error?

jfrank94 commented 3 years ago

@TristonC The docker image version is 19.3.12, but it's running through a Kubernetes cluster on Azure. The kernel version is 4.15.0-1096-azure and the kublet version is 1.17.9. In order to reproduce this error, you should download these packages first on a Jupyter notebook:

!sudo apt install libquadmath0 !sudo apt install libnccl2=2.8.4-1+cuda11.2 --allow-downgrades -y !pip install mxnet-cu112 !pip install --upgrade gluonts !pip install scipy !pip install xlrd==1.2.0

Then import the mxnet library as well as these libraries under GluonTS:

from gluonts.dataset.common import ListDataset from gluonts.model.deepar import DeepAREstimator from gluonts.mx.trainer import Trainer from gluonts.dataset.util import to_pandas from gluonts.evaluation import Evaluator from gluonts.evaluation.backtest import make_evaluation_predictions from gluonts.support.util import get_download_path from gluonts.mx.trainer import Trainer from gluonts.model.predictor import Predictor

Then, you can import a dataset and convert it into a ListDataSet like so:

training_data = ListDataset( [{"start": train.index[0], "target": train.value}], freq = "1H" )

Finally, initialize the DeepAREstimator and train it like so (the train function is what produced the error in the first place):

estimator = DeepAREstimator(freq="1D", prediction_length=28, trainer=Trainer(epochs=60, ctx=ctx, num_batches_per_epoch=1), num_parallel_samples=1)
train1_output = estimator.train(training_data=training_data, validation_data=validation_data)

I hope this helps!

TristonC commented 3 years ago

Yeah, I could not reproduce the error in my local container (NGC container with MXNet 1.9) with GPU. I don't have access to Azure. @leezu Could you comment on this?

jfrank94 commented 3 years ago

@TristonC Maybe it might help to have the folks who worked on the GluonTS library help as well? Thanks for your help so far, btw.

TristonC commented 3 years ago

The trace goes down to MXNet. But sure, @szha Any GluonTS friend can help on this issue?

szha commented 3 years ago

cc @lostella

lostella commented 3 years ago

looks like the same as https://github.com/awslabs/gluon-ts/issues/1571

jfrank94 commented 3 years ago

@TristonC @szha Are there any updates as of yet?

szha commented 3 years ago

GluonTS support should go in awslabs/gluon-ts#1571 as mxnet doesn't have knowledge of the downstream application. If there's any unexpected behavior in mxnet, please share how that's reproduced in a bug report here.

jfrank94 commented 3 years ago

@szha Alright, no problem. Thanks for your help.

apache / mxnet

Issues with GluonTS library (MXNet Error) #20363

`0%| | 0/1 [00:00<?, ?it/s] learning rate from "lr_scheduler" has been overwritten by "learning_rate" in optimizer. 0%| | 0/1 [00:03<?, ?it/s]