GoogleCloudPlatform / batch-samples

56 stars 26 forks source link

Script and Container jobs failing with: The resource 'projects/<ProjectNumber>/global/networks/default' was not found #27

Open rmharrison opened 1 year ago

rmharrison commented 1 year ago

I ran both busybox (Container Job) and transcoding (Script Job). Using both the sample scripts and web console.

This error does not exist in Troubleshooting.

My best guess...

  1. Batch assumes the default network
  2. I don't have a default network set

==> How do I set my default network? I have an existing network interface that I've used without incident for a manually provisioned Compute Engine VM instance.

Busybox, Script

# gcloud beta batch jobs describe job-busybox-9172 --location=us-central1
...
status:
  runDuration: 0s
  state: FAILED
  statusEvents:
  - description: Job state is set from QUEUED to SCHEDULED for job projects/<ProjectNumber>/locations/us-central1/jobs/job-busybox-9172.
    eventTime: '2023-03-07T22:04:48.589892441Z'
    type: STATUS_CHANGED
  - description: "Job gets no longer retryable information Batch Error: code - CODE_GCE_RESOURCE_NOT_FOUND,\
      \ description - googleapi: Error 404: The resource 'projects/<ProjectNumber>/global/networks/default'\
      \ was not found, notFound, already retried 3 times, errors record CODE_GCE_RESOURCE_NOT_FOUND."
    eventTime: '2023-03-07T22:08:21.906612928Z'
    type: OPERATIONAL_INFO
  - description: Job state is set from SCHEDULED to SCHEDULED_PENDING_FAILED for job
      projects/<ProjectNumber>/locations/us-central1/jobs/job-busybox-9172.
    eventTime: '2023-03-07T22:08:22.671616487Z'
    type: STATUS_CHANGED
  - description: Job state is set from SCHEDULED_PENDING_FAILED to FAILED for job
      projects/<ProjectNumber>/locations/us-central1/jobs/job-busybox-9172.
    eventTime: '2023-03-07T22:08:23.845508708Z'
    type: STATUS_CHANGED

Transcoding, Web Console

# gcloud beta batch jobs describe transcode-manual --location=us-central1
...
status:
  runDuration: 0s
  state: FAILED
  statusEvents:
  - description: Job state is set from QUEUED to SCHEDULED for job projects/<ProjectNumber>/locations/us-central1/jobs/transcode-manual.
    eventTime: '2023-03-07T21:55:45.825619439Z'
    type: STATUS_CHANGED
  - description: "Job gets no longer retryable information Batch Error: code - CODE_GCE_RESOURCE_NOT_FOUND,\
      \ description - googleapi: Error 404: The resource 'projects/<ProjectNumber>/global/networks/default'\
      \ was not found, notFound, already retried 3 times, errors record CODE_GCE_RESOURCE_NOT_FOUND."
    eventTime: '2023-03-07T21:59:22.741203703Z'
    type: OPERATIONAL_INFO
  - description: Job state is set from SCHEDULED to SCHEDULED_PENDING_FAILED for job
      projects/<ProjectNumber>/locations/us-central1/jobs/transcode-manual.
    eventTime: '2023-03-07T21:59:23.388154925Z'
    type: STATUS_CHANGED
  - description: Job state is set from SCHEDULED_PENDING_FAILED to FAILED for job
      projects/<ProjectNumber>/locations/us-central1/jobs/transcode-manual.
    eventTime: '2023-03-07T21:59:24.291823883Z'
    type: STATUS_CHANGED
rmharrison commented 1 year ago

Workaround using custom VM Instance Template (Transcoding example)

GCP Batch has instructions to use a custom VM instance template

I created an Instance Template via the web console. Selecting my existing Network interface under "Advanced options" > "Networking" > "Network interfaces" instance-template-redacted

Modified job.json to use the instanceTemplate instead of default policy

...
  "allocationPolicy": {
    "instances": [
      {
        "instanceTemplate": "[instance-template-created-in-console]"
      }
    ]
  },
...

I also had to modify transcode.sh

vopts="-c:v libvpx-vp9 -b:v 1800k -minrate 1500 -maxrate 1610"

Quotes around the options Because all of my instances failed with "2023-03-07 17:54:35.356 EST /mnt/share/transcode.sh: line 26: libvpx-vp9: command not found"

rmharrison commented 1 year ago

Add to troubleshooting, because this was somewhat gnarly to resolve?

Root cause fix

For this error in your GCP Log Explorer "Query results"

The resource 'projects/[PROJECT_NUMBER]/global/networks/default' was not found

The project [PROJECT_NUMBER] did not have a default VPC created at project creation. This can occur in centrally managed enterprise accounts where an enterprise administrator uses a global default for the organization instead of project-specific defaults.

See also:

There doesn't seem to be a way to restore the actual "default", as it created only at project creation. See: https://stackoverflow.com/questions/45789502/restore-google-cloud-default-network

However, you can resolve by manually creating a VPC named "default".

Briefly, from the GCP VPC Console

  1. Click "Create VPC Network" at near the top of the screen
  2. Set the "Name" to "default"
  3. Set "Subnets > Subnet creation mode" to "Automatic"
  4. Everything else should be default

Complete instructions for creating a VPC: https://cloud.google.com/vpc/docs/create-modify-vpc-networks#create-auto-network