PolyAI-LDN / conversational-datasets

Large datasets for conversational AI
Apache License 2.0
1.29k stars 167 forks source link

GCP Authentication Failure #42

Closed tonyhqanguyen closed 5 years ago

tonyhqanguyen commented 5 years ago

Hi, I was just wondering what the fix is for this issue. For the reddit dataset, I have followed all the steps up to before executing: python tools/tfrutil.py pp ${DATADIR?}/train-00999-of-01000.tfrecords

But when I do, I get this error:

2019-05-24 10:20:36.304120: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.753181 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:37.063359: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.220163 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:37.288225: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.174629 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:37.466637: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.375593 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:37.847847: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.587914 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:38.440436: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.06559 seconds (attempt 6 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:39.512649: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.777596 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:40.294343: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.71192 seconds (attempt 8 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:42.010957: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.02302 seconds (attempt 9 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:43.041673: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.138 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata' 2019-05-24 10:20:44.186215: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'". Traceback (most recent call last): File "tools/tfrutil.py", line 118, in _cli() File "/Library/Python/2.7/site-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/Library/Python/2.7/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/Library/Python/2.7/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Library/Python/2.7/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/Library/Python/2.7/site-packages/click/core.py", line 555, in invoke return callback(args, **kwargs) File "tools/tfrutil.py", line 46, in _pretty_print for i, record in enumerate(tf.python_io.tf_record_iterator(path)): File "/Library/Python/2.7/site-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator reader.GetNext() File "/Library/Python/2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self) tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{ "error": { "errors": [ { "domain": "global", "reason": "required", "message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords.", "locationType": "header", "location": "Authorization" } ], "code": 401, "message": "Anonymous caller does not have storage.objects.get access to reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords." } } ' when reading metadata of gs://reddit-conv-data/reddit/20190524/train-00999-of-01000.tfrecords

I suppose this is due to it not being able to access my credentials, so I followed the instructions here:

https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances

and downloaded a <project>-<code>.json file with { "type": "service_account", "project_id": "xxxx", "private_key_id": "xxxxxxxxx", "private_key": "-----BEGIN PRIVATE KEY-----\n xxxxxxx \n-----END PRIVATE KEY-----\n", "client_email": "xxxxx@developer.gserviceaccount.com", "client_id": "xxxxxxx", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "xxxxxxxxxxxxxx", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/xxxxxxx" }

The error still persists. I would really appreciate any advice.

tonyhqanguyen commented 5 years ago

I think I got that part figured out, I thought the data was successfully loaded into the bucket when I ran python reddit/create_data.py \ --output_dir ${DATADIR?} \ --reddit_table ${PROJECT?}:${DATASET?}.${TABLE?} \ --runner DataflowRunner \ --temp_location ${DATADIR?}/temp \ --staging_location ${DATADIR?}/staging \ --project ${PROJECT?} but apparently nothing was happening when I did that.

However, now that it's running, it's predicting 18 hours runtime required. Is this normal?

matthen commented 5 years ago

Can you check how many workers the dataflow job is using on the dataflow console? You may need to increase your quota for it to parallelise over more machines.

From the readme:

Typical metrics for the Dataflow job:

Total vCPU time: 625.507 vCPU hr Total memory time: 322.5 GB hr Total persistent disk time: 156,376.805 GB hr Elapsed time: 1h 38m 409 workers) Estimated cost: 44 USD

tonyhqanguyen commented 5 years ago

@matthen Yeah the quota I have is just limiting the number of workers. Thanks!

matthen commented 5 years ago

cool, glad it's working! I added a note in #43