GoogleCloudPlatform / compute-image-tools

Tools and scripts for Google Compute Engine images.
https://cloud.google.com/compute/docs/images
Apache License 2.0
203 stars 147 forks source link

Windows image import fails because of a timer in translate step #886

Open CristKa opened 5 years ago

CristKa commented 5 years ago

The gcloud compute images import allows to change the default timeout (2 hours by default). However, there is a another timeout of 60 min within the translate step of the import workflow. When importing large .vhd files, this 60 min timer can be exceeded and the whole import workflow fails :

Starting image translate...\"" 
[import-and-translate]: 2019-08-15T17:51:40Z Error running workflow: step "translate" run error: step "translate-disk" did not complete within the specified timeout of 1h0m0s
[import-and-translate]: 2019-08-15T17:51:40Z Workflow "import-and-translate" cleaning up (this may take up to 2 minutes).
2019/08/15 17:54:57 step "translate" run error: step "translate-disk" did not complete within the specified timeout of 1h0m0s
[import-and-translate]: 2019-08-15T17:54:57Z Workflow "import-and-translate" finished cleanup. 
ERROR 
ERROR: build step 0 "gcr.io/compute-image-tools/gce_vm_image_import:release" failed: exit status 1

https://github.com/GoogleCloudPlatform/compute-image-tools/blob/9d509270bfa832d53e046f75ae1254e43a1a9a45/daisy_workflows/image_import/windows/translate_windows_wf.json#L137

adjackura commented 5 years ago

If that step is failing at 1 hour I would expect the translation to have already failed. The image being imported shouldn't impact time to run that step much and one hour should be more than sufficient. My guess would be something prevented either Windows from booting (divers don't load) or the script runner from running (we've had problems with antivirus software in the past) and if you view the serial logs there probably isn't anything there after the firmware. We are working on more verbose error messages for this step that should hopefully make this type of failure easier to diagnose.

CristKa commented 5 years ago

Allright, I agree. I somehow managed to start manually the translation instance, let it run for >10 hours and it is still not finished.

The serial console says :

SeaBIOS (version 1.8.2-20190620_103534-google)
Total RAM Size = 0x00000003c0000000 = 15360 MiB
CPUs found: 4     Max CPUs supported: 4
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=1015021568 = 495616 MiB
drive 0x000f2940: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=1015021568
Booting from Hard Disk 0...
"Translate: Starting image translate..." 

The CPU is working strangely, like stuck in a loop, see image below. It is a 4 vCPU instance, but looks like only 1 vCPU is used:

Screenshot 2019-08-16 at 09 45 39

Right, some additional logs would help to diagnose what is going on.

adjackura commented 5 years ago

That's actually really interesting output. Because you got that starting image translate line the image was able to boot successfully, but the process didn't kick off as you didn't see any further output. This is exactly what we see if antivirus software or binary whitelisting prevents our software from running. Even if something else like the network has failed I would expect at least an error on the serial console (the translate process should only take 10-20min). What version of Windows is this? Also one other thing you could try if you feel like troubleshooting is waiting a bit then resetting the instance, it should come back up and restart the translation process. If the issue is some software or driver oddity in your image the reset is just a harsh way to force a reboot.

https://github.com/GoogleCloudPlatform/compute-image-tools/blob/master/daisy_workflows/image_import/windows/run_startup_scripts.cmd#L19

On Fri, Aug 16, 2019, 12:49 AM Christian notifications@github.com wrote:

Allright, I agree. I somehow managed to start manually the translation instance, let it run for >10 hours and it is still not finished.

The serial console says :

SeaBIOS (version 1.8.2-20190620_103534-google) Total RAM Size = 0x00000003c0000000 = 15360 MiB CPUs found: 4 Max CPUs supported: 4 found virtio-scsi at 0:3 virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0 virtio-scsi blksize=512 sectors=1015021568 = 495616 MiB drive 0x000f2940: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=1015021568 Booting from Hard Disk 0... "Translate: Starting image translate..."

The CPU is working strangely, like stuck in a loop, see image below. It is a 4 vCPU instance, but looks like only 1 vCPU is used:

[image: Screenshot 2019-08-16 at 09 45 39] https://user-images.githubusercontent.com/8123374/63151852-b62b2e80-c00a-11e9-9156-415321ae7cc4.png

Right, some additional logs would help to diagnose what is going on.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/compute-image-tools/issues/886?email_source=notifications&email_token=AEC3ESYMNVB7NR73RCLMDOLQEZLYHA5CNFSM4IMB45CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4N6AUA#issuecomment-521920592, or mute the thread https://github.com/notifications/unsubscribe-auth/AEC3ES2GL7NV4UPB3SHCQBTQEZLYHANCNFSM4IMB45CA .

adjackura commented 5 years ago

If your stuck on diagnosing the issue and feel comfortable sharing the image with us for troubleshooting I can share info on how you can get it to us over email. Otherwise we can continue to help you in this issue.

CristKa commented 5 years ago

Thanks for your help ! This is a Windows 2008r2 machine. There are 3 partitions, the .vhd file contain the 3 partitions. All my first tries were done using this .vhd. I managed to mount the imported disk with the partitions to another Windows machine on GCE :

Screenshot 2019-08-16 at 23 21 17

After having failed with all procedures, I also tried to import only the bootdisk ( C: of 78.03GB = E: in the screenshot ) partition to GS and launch the daisy import. It looked like it worked but generated a 520GB file, and then the daisy translate didn't even start.

So I'm actually stuck, so I would like to share with you the image ( I will need a management approval on my side), thanks a lot for this proposal. You can email me the procedure.

leonardoantonio19 commented 9 months ago

Me ocurre el mismo problema que CrisKA, será que pudieron repararlo?