Closed jfrank94 closed 3 years ago
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
@jfrank94 Which docker image did you use? Could you share more details about how to reproduce the error?
@TristonC The docker image version is 19.3.12, but it's running through a Kubernetes cluster on Azure. The kernel version is 4.15.0-1096-azure and the kublet version is 1.17.9. In order to reproduce this error, you should download these packages first on a Jupyter notebook:
!sudo apt install libquadmath0 !sudo apt install libnccl2=2.8.4-1+cuda11.2 --allow-downgrades -y !pip install mxnet-cu112 !pip install --upgrade gluonts !pip install scipy !pip install xlrd==1.2.0
Then import the mxnet library as well as these libraries under GluonTS:
from gluonts.dataset.common import ListDataset from gluonts.model.deepar import DeepAREstimator from gluonts.mx.trainer import Trainer from gluonts.dataset.util import to_pandas from gluonts.evaluation import Evaluator from gluonts.evaluation.backtest import make_evaluation_predictions from gluonts.support.util import get_download_path from gluonts.mx.trainer import Trainer from gluonts.model.predictor import Predictor
Then, you can import a dataset and convert it into a ListDataSet like so:
training_data = ListDataset( [{"start": train.index[0], "target": train.value}], freq = "1H" )
Finally, initialize the DeepAREstimator and train it like so (the train function is what produced the error in the first place):
estimator = DeepAREstimator(freq="1D", prediction_length=28, trainer=Trainer(epochs=60, ctx=ctx,
num_batches_per_epoch=1), num_parallel_samples=1)
train1_output = estimator.train(training_data=training_data, validation_data=validation_data)
I hope this helps!
Yeah, I could not reproduce the error in my local container (NGC container with MXNet 1.9) with GPU. I don't have access to Azure. @leezu Could you comment on this?
@TristonC Maybe it might help to have the folks who worked on the GluonTS library help as well? Thanks for your help so far, btw.
The trace goes down to MXNet. But sure, @szha Any GluonTS friend can help on this issue?
cc @lostella
looks like the same as https://github.com/awslabs/gluon-ts/issues/1571
@TristonC @szha Are there any updates as of yet?
GluonTS support should go in awslabs/gluon-ts#1571 as mxnet doesn't have knowledge of the downstream application. If there's any unexpected behavior in mxnet, please share how that's reproduced in a bug report here.
@szha Alright, no problem. Thanks for your help.
I'm running into an error when running the DeepAREstimator from the GluonTS library. I have downloaded all of the necessary packages, and with using the mx.gpu() function, it recognizes that the GPU (Nvidia Cuda 11.2) exists on the system
Note, that I'm able to run the code fine on Colab without having to run specific commands like installing "libquadmath0" or the NCCL library (ver 2.8.4 for cuda 11.2), but when running on a docker image, this seems to be the case.
Here's the full error trace:
`0%| | 0/1 [00:00<?, ?it/s] learning rate from "lr_scheduler" has been overwritten by "learning_rate" in optimizer. 0%| | 0/1 [00:03<?, ?it/s]
MXNetError Traceback (most recent call last)