GoogleCloudPlatform / training-data-analyst

Labs and demos for courses for GCP Training (http://cloud.google.com/training).
Apache License 2.0
7.81k stars 5.83k forks source link

Courses Deepdive Sequence Poetry have T2T version conflicts #500

Open jsnowacki opened 5 years ago

jsnowacki commented 5 years ago

The problem is related to courses/machine_learning/deepdive/09_sequence/courses/machine_learning/deepdive/09_sequence, which is used at Coursera's Sequence Models for Time Series and Natural Language Processing, part of Advanced Machine Learning with TensorFlow on Google Cloud Platform. When one reaches the Train model locally on subset of data part the below commad:

%%bash
DATA_DIR=gs://${BUCKET}/poetry/subset
OUTDIR=./trained_model
rm -rf $OUTDIR
t2t-trainer \
  --data_dir=gs://${BUCKET}/poetry/subset \
  --t2t_usr_dir=./poetry/trainer \
  --problem=$PROBLEM \
  --model=transformer \
  --hparams_set=transformer_poetry \
  --output_dir=$OUTDIR --job-dir=$OUTDIR --train_steps=10

throws an error:

/usr/local/envs/py2env/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/usr/local/envs/py2env/bin/t2t-trainer", line 23, in <module>
    from tensor2tensor.bin import t2t_trainer
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensor2tensor/bin/t2t_trainer.py", line 25, in <module>
    from tensor2tensor import models  # pylint: disable=unused-import
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensor2tensor/models/__init__.py", line 25, in <module>
    from tensor2tensor.layers import modalities  # pylint: disable=g-import-not-at-top
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensor2tensor/layers/modalities.py", line 28, in <module>
    from tensor2tensor.layers import common_attention
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensor2tensor/layers/common_attention.py", line 31, in <module>
    from tensor2tensor.layers import common_layers
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensor2tensor/layers/common_layers.py", line 30, in <module>
    import tensorflow_probability as tfp
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow_probability/__init__.py", line 68, in <module>
    _ensure_tf_install()
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow_probability/__init__.py", line 65, in _ensure_tf_install
    present=tf.__version__))
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 1.14; Detected an installation of version 1.13.1. Please upgrade TensorFlow to proceed.

On the other hand, trying to update tensorflow in the top setup cells as explained in the exception, another issue arises:

tensorflow 1.14.0 has requirement numpy<2.0,>=1.14.5, but you'll have numpy 1.14.0 which is incompatible.

Also, as you on it, it'd be good IMO to bump the notebook's version of python up to 3, currently it's 2.

jsnowacki commented 5 years ago

Changing the versions with the below values seems to fix the error:

pip install tensor2tensor==1.13.4 tensorflow==1.14 tensorflow-serving-api==1.14.0rc0 gutenberg numpy==1.14.6
jsnowacki commented 5 years ago

OK correction, it fixed just the local run; if you try to run Cloud ML Engine training via the command:

%%bash
GPU="--train_steps=7500 c --worker_gpu=1 --hparams_set=transformer_poetry"

DATADIR=gs://${BUCKET}/poetry/data
OUTDIR=gs://${BUCKET}/poetry/model
JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
echo "'Y'" | t2t-trainer \
  --data_dir=gs://${BUCKET}/poetry/subset \
  --t2t_usr_dir=./poetry/trainer \
  --problem=$PROBLEM \
  --model=transformer \
  --output_dir=$OUTDIR \
  ${GPU}

the same error gets thrown by to worker.

lakshmanok commented 5 years ago

unfortunately, project gutenberg doesn't support Python 3. To fix the ML Engine run, could you try modifying setup.py to pin the versions of the libraries as above? If that works, please submit a pull-request with your changes.

jsnowacki commented 5 years ago

I've fixed setup.py section to:

%%writefile poetry/setup.py
from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'tensor2tensor==1.13.4',
    'tensorflow==1.14', 
    'tensorflow-serving-api==1.14.0rc0',
    'numpy==1.14.6'
]

setup(
    name='poetry',
    version='0.1',
    author = 'Google',
    author_email = 'training-feedback@cloud.google.com',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='Poetry Line Problem',
    requires=[]
)

But I still get the following error on AI Platform:

2019-07-12 10:15:55.415651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
WARNING: Logging before flag parsing goes to stderr.
W0712 10:15:57.498725 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/expert_utils.py:68: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0712 10:15:58.259337 139815414183680 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0712 10:16:00.313907 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/adafactor.py:27: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0712 10:16:00.314401 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/multistep_optimizer.py:32: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W0712 10:16:00.326944 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/mesh_tensorflow/ops.py:4237: The name tf.train.CheckpointSaverListener is deprecated. Please use tf.estimator.CheckpointSaverListener instead.

W0712 10:16:00.327136 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/mesh_tensorflow/ops.py:4260: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

W0712 10:16:00.361661 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/rl/gym_utils.py:219: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0712 10:16:00.398224 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/trainer_lib.py:109: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

W0712 10:16:01.014122 139815414183680 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0712 10:16:01.014338 139815414183680 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0712 10:16:01.014474 139815414183680 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:33: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

I0712 10:16:01.014980 139815414183680 usr_dir.py:43] Importing user module trainer from path /home/jupyter/training-data-analyst/courses/machine_learning/deepdive/09_sequence/poetry
W0712 10:16:01.015603 139815414183680 deprecation_wrapper.py:119] From /home/jupyter/training-data-analyst/courses/machine_learning/deepdive/09_sequence/poetry/trainer/problem.py:10: The name tf.summary.FileWriterCache is deprecated. Please use tf.compat.v1.summary.FileWriterCache instead.

W0712 10:16:01.016242 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/hparams_lib.py:49: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0712 10:16:01.199728 139815414183680 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/trainer_lib.py:780: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

I0712 10:16:01.452857 139815414183680 cloud_mlengine.py:337] Launching job transformer_poetry_line_problem_t2t_20190712_101601 with ML Engine spec:
{'jobId': 'transformer_poetry_line_problem_t2t_20190712_101601',
 'labels': {'hparams': 'transformer_poetry',
            'model': 'transformer',
            'problem': 'poetry_line_problem'},
 'trainingInput': {'args': ['--problem=poetry_line_problem',
                            '--log_step_count_steps=100',
                            '--worker_gpu=1',
                            '--dbgprofile=False',
                            '--worker_id=0',
                            '--registry_help=False',
                            '--xla_jit_level=-1',
                            '--decode_to_file=',
                            '--save_checkpoints_secs=0',
                            '--decode_hparams=',
                            '--wiki_revision_percent_identical_examples=0.04',
                            '--eval_early_stopping_metric_minimize=True',
                            '--use_cprofile_for_profiling=True',
                            '--log_dir=',
                            '--alsologtostderr=False',
                            '--logtostderr=False',
                            '--run_with_pdb=False',
                            '--tmp_dir=/tmp/t2t_datagen',
                            '--profile=False',
                            '--?=False',
                            '--run_with_profiling=False',
                            '--decode_from_file=',
                            '--output_dir=gs://sotrender-rd-cloud-training-demos-ml/poetry/model',
                            '--op_conversion_fallback_to_while_loop=False',
                            '--use_tpu_estimator=False',
                            '--worker_gpu_memory_fraction=0.95',
                            '--train_steps=7500',
                            '--ps_replicas=0',
                            '--pdb_post_mortem=False',
                            '--parsing_path=',
                            '--keep_checkpoint_every_n_hours=10000',
                            '--std_server_protocol=grpc',
                            '--v=0',
                            '--eval_early_stopping_metric=loss',
                            '--wiki_revision_num_train_shards=50',
                            '--wiki_revision_vocab_file=',
                            '--optionally_use_dist_strat=False',
                            '--intra_op_parallelism_threads=0',
                            '--gpu_order=',
                            '--showprefixforinfo=True',
                            '--tpu_num_shards=8',
                            '--test_srcdir=',
                            '--eval_throttle_seconds=600',
                            '--use_tpu=False',
                            '--eval_early_stopping_metric_delta=0.1',
                            '--worker_replicas=1',
                            '--eval_run_autoregressive=False',
                            '--ps_gpu=0',
                            '--hparams_set=transformer_poetry',
                            '--wiki_revision_introduce_errors=True',
                            '--local_eval_frequency=1000',
                            '--xla_compile=False',
                            '--only_check_args=False',
                            '--eval_timeout_mins=240',
                            '--inter_op_parallelism_threads=0',
                            '--generate_data=False',
                            '--wiki_revision_max_page_size_exp=26',
                            '--test_tmpdir=/tmp/absl_testing',
                            '--worker_job=/job:localhost',
                            '--wiki_revision_num_dev_shards=1',
                            '--eval_use_test_set=False',
                            '--iterations_per_loop=100',
                            '--test_random_seed=301',
                            '--model=transformer',
                            '--enable_graph_rewriter=False',
                            '--log_device_placement=False',
                            '--data_dir=gs://sotrender-rd-cloud-training-demos-ml/poetry/subset',
                            '--disable_ffmpeg=False',
                            '--sync=False',
                            '--keep_checkpoint_max=20',
                            '--xml_output_file=',
                            '--tfdbg=False',
                            '--wiki_revision_max_equal_to_diff_ratio=0.0',
                            '--master=',
                            '--wiki_revision_max_examples_per_shard=0',
                            '--wiki_revision_data_prefix=',
                            '--stderrthreshold=fatal',
                            '--verbosity=0',
                            '--timit_paths=',
                            '--eval_steps=100',
                            '--schedule=continuous_train_and_eval',
                            '--export_saved_model=False',
                            '--cloud_tpu_name=jupyter-tpu',
                            '--ps_job=/job:ps',
                            '--wiki_revision_revision_skip_factor=1.5',
                            '--helpxml=False',
                            '--decode_reference=',
                            '--hparams='],
                   'jobDir': 'gs://sotrender-rd-cloud-training-demos-ml/poetry/model',
                   'masterType': 'standard_p100',
                   'pythonModule': 'tensor2tensor.bin.t2t_trainer',
                   'pythonVersion': '3.5',
                   'region': 'us-central1',
                   'runtimeVersion': '1.13',
                   'scaleTier': 'CUSTOM'}}
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 33, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 388, in main
    cloud_mlengine.launch()
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/cloud_mlengine.py", line 338, in launch
    assert confirm()
AssertionError

Not sure what is it to be honest, but it may have something to do with the option 'runtimeVersion': '1.13' in the job start command.

I've checked with T2T and the version is hard coded there: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/cloud_mlengine.py

Local training works fine.