GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 857 forks source link

Flowers Sample: ValueError: string_input_producer requires a non-null input tensor #214

Closed willbattel closed 6 years ago

willbattel commented 6 years ago

System information

Describe the problem

When following the Flowers Cloud ML Sample, I am getting an unexpected error on this step https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial#run_model_training_in_the_cloud.

Source code / logs

All commands ran to this point (from the cloudml-samples-master/flowers directory, without any errors prior to this one.

sudo pip install -r requirements.txt
virtualenv cmle-env
source cmle-env/bin/activate
export PATH="/Users/willbattel/Library/google-cloud-sdk/bin:$PATH"
export GOOGLE_APPLICATION_CREDENTIALS="/Users/willbattel/Desktop/myproject/google-application-credentials.json"
pip install --upgrade tensorflow
declare -r BUCKET_NAME="gs://somethingsomethingblahblahblah"
declare -r REGION="us-central1"
declare -r PROJECT_ID=$(gcloud config list project --format "value(core.project)")
declare -r JOB_NAME="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"
declare -r GCS_PATH="${BUCKET_NAME}/${USER}/${JOB_NAME}"
declare -r DICT_FILE=gs://cloud-ml-data/img/flower_photos/dict.txt
declare -r MODEL_NAME=flowers2
declare -r VERSION_NAME=v1
set -v -e
python trainer/preprocess.py \
    --input_dict "$DICT_FILE" \
    --input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
    --output_path "${GCS_PATH}/preproc/eval" \
    --cloud
python trainer/preprocess.py \
    --input_dict "$DICT_FILE" \
    --input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
    --output_path "${GCS_PATH}/preproc/train" \
    --cloud
gcloud ml-engine jobs submit training "$JOB_NAME" \
    --stream-logs \
    --module-name trainer.task \
    --package-path trainer \
    --staging-bucket "$BUCKET_NAME" \
    --region "$REGION" \
    --runtime-version=1.4\
    -- \
    --output_path "${GCS_PATH}/training" \
    --eval_data_paths "${GCS_PATH}/preproc/eval*" \
    --train_data_paths "${GCS_PATH}/preproc/train*"

Full trace:

INFO    2018-06-25 22:28:15 -0500   service     Validating job requirements...
INFO    2018-06-25 22:28:15 -0500   service     Job creation request has been successfully validated.
INFO    2018-06-25 22:28:16 -0500   service     Waiting for job to be provisioned.
INFO    2018-06-25 22:28:16 -0500   service     Job flowers_willbattel_20180625_222734 is queued.
INFO    2018-06-25 22:28:18 -0500   service     Waiting for training program to start.
INFO    2018-06-25 22:28:53 -0500   master-replica-0        Running task with arguments: --cluster={"master": ["127.0.0.1:2222"]} --task={"type": "master", "index": 0} --job={  "package_uris": ["gs://robust-summit-184321-mlengine/flowers_willbattel_20180625_222734/bd11158966a2c08805fbd14869f2a197c52633ed9339b8d9eb391cac667d0e1a/trainer-0.1.tar.gz"],  "python_module": "trainer.task",  "args": ["--output_path", "gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/training", "--eval_data_paths", "gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/eval*", "--train_data_paths", "gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/train*"],  "region": "us-central1",  "runtime_version": "1.4",  "run_on_raw_vm": true}
INFO    2018-06-25 22:28:57 -0500   master-replica-0        Running module trainer.task.
INFO    2018-06-25 22:28:57 -0500   master-replica-0        Downloading the package: gs://robust-summit-184321-mlengine/flowers_willbattel_20180625_222734/bd11158966a2c08805fbd14869f2a197c52633ed9339b8d9eb391cac667d0e1a/trainer-0.1.tar.gz
INFO    2018-06-25 22:28:57 -0500   master-replica-0        Running command: gsutil -q cp gs://robust-summit-184321-mlengine/flowers_willbattel_20180625_222734/bd11158966a2c08805fbd14869f2a197c52633ed9339b8d9eb391cac667d0e1a/trainer-0.1.tar.gz trainer-0.1.tar.gz
INFO    2018-06-25 22:29:00 -0500   master-replica-0        Installing the package: gs://robust-summit-184321-mlengine/flowers_willbattel_20180625_222734/bd11158966a2c08805fbd14869f2a197c52633ed9339b8d9eb391cac667d0e1a/trainer-0.1.tar.gz
INFO    2018-06-25 22:29:00 -0500   master-replica-0        Running command: pip install --user --upgrade --force-reinstall --no-deps trainer-0.1.tar.gz
INFO    2018-06-25 22:29:01 -0500   master-replica-0        Processing ./trainer-0.1.tar.gz
INFO    2018-06-25 22:29:02 -0500   master-replica-0        Building wheels for collected packages: trainer
INFO    2018-06-25 22:29:02 -0500   master-replica-0          Running setup.py bdist_wheel for trainer: started
INFO    2018-06-25 22:29:02 -0500   master-replica-0        creating '/tmp/pip-wheel-_runqd/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer/model.py'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer/task.py'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer/util.py'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer/__init__.py'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/DESCRIPTION.rst'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/metadata.json'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/top_level.txt'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/WHEEL'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/METADATA'
INFO    2018-06-25 22:29:02 -0500   master-replica-0        adding 'trainer-0.1.dist-info/RECORD'
INFO    2018-06-25 22:29:02 -0500   master-replica-0          Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO    2018-06-25 22:29:02 -0500   master-replica-0          Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO    2018-06-25 22:29:02 -0500   master-replica-0        Successfully built trainer
INFO    2018-06-25 22:29:02 -0500   master-replica-0        Installing collected packages: trainer
INFO    2018-06-25 22:29:02 -0500   master-replica-0        Successfully installed trainer-0.1
INFO    2018-06-25 22:29:02 -0500   master-replica-0        Running command: pip install --user trainer-0.1.tar.gz
INFO    2018-06-25 22:29:03 -0500   master-replica-0        Collecting tensorflow==1.0.1 (from trainer==0.1)
INFO    2018-06-25 22:29:03 -0500   master-replica-0          Downloading https://files.pythonhosted.org/packages/7e/7c/f398393beab1647be0a5e6974b8a34e4ea2d3cb7bd9e38bd43a657ed27d1/tensorflow-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl (44.1MB)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: wheel in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.0.1->trainer==0.1) (0.30.0a0)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: protobuf>=3.1.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.0.1->trainer==0.1) (3.6.0)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.0.1->trainer==0.1) (1.12.1)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.0.1->trainer==0.1) (1.10.0)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: mock>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.0.1->trainer==0.1) (2.0.0)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: setuptools in /usr/lib/python2.7/dist-packages (from protobuf>=3.1.0->tensorflow==1.0.1->trainer==0.1) (20.7.0)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: pbr>=0.11 in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow==1.0.1->trainer==0.1) (4.0.4)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Requirement already satisfied: funcsigs>=1; python_version < "3.3" in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow==1.0.1->trainer==0.1) (1.0.2)
INFO    2018-06-25 22:29:08 -0500   master-replica-0        Building wheels for collected packages: trainer
INFO    2018-06-25 22:29:08 -0500   master-replica-0          Running setup.py bdist_wheel for trainer: started
INFO    2018-06-25 22:29:08 -0500   master-replica-0        creating '/tmp/pip-wheel-lrwwnO/trainer-0.1-cp27-none-any.whl' and adding '.' to it
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer/preprocess.py'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer/model.py'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer/task.py'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer/util.py'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer/__init__.py'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer-0.1.dist-info/DESCRIPTION.rst'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer-0.1.dist-info/metadata.json'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer-0.1.dist-info/top_level.txt'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer-0.1.dist-info/WHEEL'
INFO    2018-06-25 22:29:08 -0500   master-replica-0        adding 'trainer-0.1.dist-info/METADATA'
INFO    2018-06-25 22:29:09 -0500   master-replica-0          Running setup.py bdist_wheel for trainer: finished with status 'done'
INFO    2018-06-25 22:29:09 -0500   master-replica-0          Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e
INFO    2018-06-25 22:29:09 -0500   master-replica-0        Successfully built trainer
ERROR   2018-06-25 22:29:09 -0500   master-replica-0        gapic-google-cloud-logging-v2 0.91.3 has requirement google-gax<0.16dev,>=0.15.7, but you'll have google-gax 0.12.5 which is incompatible.
INFO    2018-06-25 22:29:09 -0500   master-replica-0        Installing collected packages: tensorflow, trainer
ERROR   2018-06-25 22:29:16 -0500   master-replica-0          The script tensorboard is installed in '/root/.local/bin' which is not on PATH.
ERROR   2018-06-25 22:29:16 -0500   master-replica-0          Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
INFO    2018-06-25 22:29:16 -0500   master-replica-0          Found existing installation: trainer 0.1
INFO    2018-06-25 22:29:16 -0500   master-replica-0            Uninstalling trainer-0.1:
INFO    2018-06-25 22:29:16 -0500   master-replica-0              Successfully uninstalled trainer-0.1
INFO    2018-06-25 22:29:16 -0500   master-replica-0        Successfully installed tensorflow-1.0.1 trainer-0.1
INFO    2018-06-25 22:29:17 -0500   master-replica-0        Running command: python -m trainer.task --output_path gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/training --eval_data_paths gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/eval* --train_data_paths gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/train*
INFO    2018-06-25 22:29:20 -0500   master-replica-0        Original job data: {u'python_module': u'trainer.task', u'region': u'us-central1', u'args': [u'--output_path', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/training', u'--eval_data_paths', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/eval*', u'--train_data_paths', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/train*'], u'runtime_version': u'1.4', u'package_uris': [u'gs://robust-summit-184321-mlengine/flowers_willbattel_20180625_222734/bd11158966a2c08805fbd14869f2a197c52633ed9339b8d9eb391cac667d0e1a/trainer-0.1.tar.gz'], u'run_on_raw_vm': True}
INFO    2018-06-25 22:29:20 -0500   master-replica-0        setting eval batch size to 100
INFO    2018-06-25 22:29:20 -0500   master-replica-0        Starting master/0
WARNING 2018-06-25 22:29:20 -0500   master-replica-0        The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
WARNING 2018-06-25 22:29:20 -0500   master-replica-0        The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
WARNING 2018-06-25 22:29:20 -0500   master-replica-0        The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
WARNING 2018-06-25 22:29:20 -0500   master-replica-0        The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
INFO    2018-06-25 22:29:20 -0500   master-replica-0        Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
INFO    2018-06-25 22:29:20 -0500   master-replica-0        Started server with target: grpc://localhost:2222
ERROR   2018-06-25 22:29:20 -0500   master-replica-0        Traceback (most recent call last):
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            "__main__", fname, loader, pkg_name)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            exec code in run_globals
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            tf.app.run()
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            _sys.exit(main(_sys.argv[:1] + flags_passthrough))
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            run(model, argv)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            dispatch(args, model, cluster, task)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            Trainer(args, model, cluster, task).run_training()
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 186, in run_training
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            self.args.batch_size)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 297, in build_train_graph
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            return self.build_graph(data_paths, batch_size, GraphMod.TRAIN)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 222, in build_graph
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            num_epochs=None if is_training else 2)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 47, in read_examples
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0          File "/root/.local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 211, in string_input_producer
ERROR   2018-06-25 22:29:20 -0500   master-replica-0            raise ValueError(not_null_err)
ERROR   2018-06-25 22:29:20 -0500   master-replica-0        ValueError: string_input_producer requires a non-null input tensor
ERROR   2018-06-25 22:29:20 -0500   master-replica-0        Command '['python', '-m', u'trainer.task', u'--output_path', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/training', u'--eval_data_paths', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/eval*', u'--train_data_paths', u'gs://robust-summit-184321-mlengine/willbattel/flowers_willbattel_20180625_222734/preproc/train*']' returned non-zero exit status 1
INFO    2018-06-25 22:29:21 -0500   master-replica-0        Module completed; cleaning up.
INFO    2018-06-25 22:29:21 -0500   master-replica-0        Clean up finished.
ERROR   2018-06-25 22:29:47 -0500   service     The replica master 0 exited with a non-zero status of 1.
dizcology commented 6 years ago

@wbattel4607 Could you confirm that the train_data_paths you pass in points to the correct paths?

willbattel commented 6 years ago

@dizcology it seems to be correct. I’m using the same value for $GCS_PATH as the preproc step, which succeeded. (However, the docs said the preproc step would take 60 minutes but it only took 8 when I did it, so idk what’s up with that). But yeah, the data is there so I don’t think that’s the issue. I’ve just been following the guide word-for-word because I’m totally new to ML.

willbattel commented 6 years ago

I figured it out using your suggestion. I had been using different job numbers when I was supposed to use the same ones. All good now. Thank you!