dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Gluon NLP Horovod Issue #1415

Open gauravrele87 opened 4 years ago

gauravrele87 commented 4 years ago

I was trying to use the bert model give here using SageMaker MXNet estimator and horovod and its giving me errors

script: https://github.com/dmlc/gluon-nlp/tree/v0.10.x/scripts/bert . I was using finetune_squad.py using the following code:

from sagemaker.mxnet import MXNet
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()

role = sagemaker.get_execution_role()

hyperparameters = {
    'comm_backend':'horovod',
    }

num_instances = 2 # How many nodes you want to use.
instance_family = 'ml.p3.2xlarge' # Which instance type you want to use.
source_name = 'finetune_squad.py'

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': 2, #Each instance has 8 gpus
            'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO'
                        }
                }

estimator = MXNet(
                entry_point=source_name,         #Script entry point.
                source_dir='.',                #Script Location
                role=role, 
                train_instance_type=instance_family,
                train_instance_count=num_instances,
                framework_version='1.7.0',            #MXNet version.
                train_volume_size=10,                #Size for the dataset.
                py_version='py3',                     #Python version.
                hyperparameters=hyperparameters,
                distributions=distributions           #For use with Horovod.
)

Description

(A clear and concise description of what the bug is.)

Error Message

(Paste the complete error message, including stack trace.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. 2.

What have you tried to solve it?

1. 2.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
sxjscience commented 4 years ago

Would you attach more details on how we may reproduce the issue?

gauravrele87 commented 4 years ago

And here is the error: image(4)