CCI-MOC / esi

Elastic Secure Infrastructure project
6 stars 12 forks source link

These unassigned ESI nodes fail the provision test #511

Closed tzumainn closed 2 months ago

tzumainn commented 4 months ago

After running the provision test on the unassigned ESI nodes, these are failing. I'll put them into maintenance for now.

MOC-R4PAC21U05-S3, fail, Node bc0d30ef-4a8c-45d6-a040-38f020482dea reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node bc0d30ef-4a8c-45d6-a040-38f020482dea : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC21U07-S3, fail, Node cb63d0cb-2c8e-4d15-9b1c-7ec0db7dd30d reached failure state "clean failed"; the last error is None MOC-R4PAC22U21-S1, fail, Node 347b0fd8-5d13-406d-b335-7979c7922005 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 347b0fd8-5d13-406d-b335-7979c7922005 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC22U21-S3, fail, Node ab0834da-9319-4eea-9832-578d3dc94dc1 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node ab0834da-9319-4eea-9832-578d3dc94dc1 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC22U23-S1, fail, Node 248f592f-5f86-455b-b279-32a86b5c9c9c reached failure state "clean failed"; the last error is None MOC-R4PAC22U23-S3, fail, Node a6f5c97b-a77f-4bee-847b-a25d069acdc3 reached failure state "clean failed"; the last error is None MOC-R4PAC22U17-S1, fail, Node 56fb2b27-0af7-4096-b90f-ff2ca01a51b3 reached failure state "clean failed"; the last error is None MOC-R4PAC22U17-S3, fail, Node 06aa9cca-87b2-4c1c-a9aa-56c122c33316 reached failure state "clean failed"; the last error is None MOC-R4PAC22U15-S1, fail, Node cb835840-6174-47b4-92f0-ada4bb641a7e reached failure state "clean failed"; the last error is None MOC-R4PAC22U15-S3, fail, Node ed6d1370-50b3-43a6-a47a-d7ff9fc3d7c8 reached failure state "clean failed"; the last error is None MOC-R4PAC22U09-S3, fail, Node 201b25a2-17dc-40fd-b36c-f8576fa0b727 reached failure state "clean failed"; the last error is None MOC-R4PAC22U07-S1, fail, Node aff0d858-57b6-4ba1-930c-283c606012d7 reached failure state "clean failed"; the last error is None MOC-R4PAC24U27-S3, fail, Node c58dbe0f-9816-42cc-bf9f-df3a52d33d06 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node c58dbe0f-9816-42cc-bf9f-df3a52d33d06 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC24U29-S3, fail, Node 2d32ca75-04d6-47f4-97c7-fa61bb4ab91f reached failure state "clean failed"; the last error is None MOC-R4PAC24U25-S1, fail, Node 2afadaa5-1d77-4ad4-a995-029cf7024135 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 2afadaa5-1d77-4ad4-a995-029cf7024135 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC24U23-S1, fail, Node 300cc62f-d51d-4e1d-acb1-b20cc766df1b reached failure state "clean failed"; the last error is None MOC-R4PAC24U23-S3, fail, Node b385e960-810d-457f-96bf-deffaa12f49a reached failure state "clean failed"; the last error is None MOC-R4PAC24U19-S1, fail, Node 0e9648cc-298a-4e1c-a4e4-7b4eb05b7f41 reached failure state "clean failed"; the last error is None MOC-R4PAC24U15-S1, fail, Node 152190b7-a761-4d5c-89f9-758986718975 reached failure state "clean failed"; the last error is None MOC-R4PAC24U13-S1, fail, Node 07b480f3-bcf2-4098-b80e-95ab4c75f0ab reached failure state "clean failed"; the last error is None MOC-R4PAC24U09-S1, fail, Node d537a8c7-f7a9-4197-bd94-660765a2ba5f reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node d537a8c7-f7a9-4197-bd94-660765a2ba5f : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC24U17-S1, fail, Node 8028f494-31b7-4ac7-a49f-f6432022db20 reached failure state "clean failed"; the last error is None MOC-R4PAC24U11-S1, fail, Node 6e740347-d8ea-46d6-a04f-7f5960e9044c reached failure state "clean failed"; the last error is None MOC-R4PAC24U07-S1, fail, Node fc2f65e4-80fd-446d-8b5d-e86609442696 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node fc2f65e4-80fd-446d-8b5d-e86609442696 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC24U11-S3, fail, Node 5d320349-2c30-4301-bade-20dfe99c049d reached failure state "clean failed"; the last error is None MOC-R4PAC24U07-S3, fail, Node 2c2f2098-2a31-4e4f-9464-161d38ddb403 reached failure state "clean failed"; the last error is None MOC-R4PAC24U37-S1C, fail, Node ee664a3b-3be7-4e06-9523-2a3cceea41dd reached failure state "clean failed"; the last error is None MOC-R4PAC24U37-S1A, fail, Node eb932b4d-2d7c-4890-ac83-03240a1ae3ec reached failure state "clean failed"; the last error is None MOC-R4PAC24U37-S1D, fail, Node b5a08bce-6157-4eca-aa90-c27cd991e281 reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S1A, fail, Node ecc54a0f-0355-4edd-952f-980242e22a0f reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S1B, fail, Node 2fcb7ee5-5811-4396-8456-791d6ec134c2 reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S1C, fail, Node 14eb9473-730f-4bf9-9817-30728f595c95 reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S3A, fail, Node 43ceb9e9-dc47-4ad6-946c-fcab27509481 reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S3B, fail, Node 99424991-d528-4335-b9f0-ba381011d6ca reached failure state "clean failed"; the last error is None MOC-R4PAC24U35-S3C, fail, Node 3352e085-d737-4764-ae08-d4ab02b6f890 reached failure state "deploy failed"; the last error is Timeout reached while waiting for callback for node 3352e085-d737-4764-ae08-d4ab02b6f890 MOC-R4PAC08U37-S3B, fail, Node c5fcb4da-8fbc-496a-a7ea-27a695d7623e reached failure state "clean failed"; the last error is None MOC-R4PAC08U37-S3C, fail, Node 499423c9-c2a9-45f6-ac12-5ba3e50844f2 reached failure state "clean failed"; the last error is None MOC-R4PAC08U35-S3D, fail, Node fa2d5c0d-a857-4f9d-9df3-b704e8643bfa reached failure state "clean failed"; the last error is None MOC-R4PAC08U31-S3D, fail, Node 5c02a05f-cd94-4527-9001-11758845d6dc reached failure state "deploy failed"; the last error is Timeout reached while waiting for callback for node 5c02a05f-cd94-4527-9001-11758845d6dc MOC-R4PAC08U21-S3, fail, Node 6a1f0c7c-8565-4557-8fa7-5b345c9adc62 reached failure state "clean failed"; the last error is None MOC-R4PAC08U19-S1, fail, Node de32f663-5841-49cd-bbfe-7fe88bdeaf74 reached failure state "clean failed"; the last error is None MOC-R4PAC08U19-S3, fail, Node d29631cc-d2e5-41b7-9376-f1e2100d31df reached failure state "clean failed"; the last error is None MOC-R4PAC08U13-S3, fail, Node bff10476-e35d-4d60-80ea-cd76675b102b reached failure state "clean failed"; the last error is None MOC-R4PAC08U11-S1, fail, Node 19184704-3a21-4662-9831-f42cd80d7c51 reached failure state "clean failed"; the last error is None MOC-R4PAC08U07-S1, fail, Node 80ea7175-0eb6-4070-b448-35bc94e037c8 reached failure state "clean failed"; the last error is None MOC-R4PAC10U31-S3A, fail, Node 66fecde2-27fa-42e3-a953-18420824cfa9 reached failure state "deploy failed"; the last error is Timeout reached while waiting for callback for node 66fecde2-27fa-42e3-a953-18420824cfa9 MOC-R4PAC10U31-S3B, fail, Node a88d409c-e5ca-435b-a6c0-4347017a48ed reached failure state "deploy failed"; the last error is Timeout reached while waiting for callback for node a88d409c-e5ca-435b-a6c0-4347017a48ed MOC-R4PAC10U31-S3D, fail, Node 5d318160-aa47-4df2-92cc-c97866a9608e reached failure state "clean failed"; the last error is None MOC-R4PAC10U19-S1, fail, Node 656067ee-a873-4649-824d-b2ff5f6ded20 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 656067ee-a873-4649-824d-b2ff5f6ded20 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC10U27-S3, fail, Node 4856811b-f5bc-4001-bf50-6a65a0df9ace reached failure state "clean failed"; the last error is None MOC-R4PAC10U19-S3, fail, Node 40538451-45f5-4a1c-a250-68a61721e63e reached failure state "clean failed"; the last error is None MOC-R4PAC10U05-S3, fail, Node b3492f07-0558-4e3c-9f98-ff18bdc59c24 reached failure state "deploy failed"; the last error is Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node b3492f07-0558-4e3c-9f98-ff18bdc59c24 : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 1. stdout: . stderr: Error: partition length of 6243221504 sectors exceeds the msdos-partition-table-imposed maximum of 4294967295 MOC-R4PAC10U09-S3, fail, Node 3a6b3a21-06d5-4bc9-8c73-9775ae200033 reached failure state "clean failed"; the last error is None

tzumainn commented 4 months ago

@hakasapl How urgent would it be to try and fix these nodes? There are 53 failures out of 187 free nodes, so 134 nodes are still good.

msdisme commented 4 months ago

@hakasapl for the write drives is this a thing techsquare can check and/or swap from the batch of drives that flax has? @naved and @hakasapl to discuss in a 2:30 call

hakasapl commented 4 months ago

@msdisme Let's discuss in the 2:30 and summarize here after. I have some questions for @tzumainn as well about this.

hakasapl commented 3 months ago

These nodes should be ready now (needs to be inspected and cleaned first):

tzumainn commented 3 months ago

These nodes should be ready now (needs to be inspected and cleaned first):

  • MOC-R4PAC21U05-S3
  • MOC-R4PAC22U21-S1
  • MOC-R4PAC22U21-S3
  • MOC-R4PAC24U27-S3
  • MOC-R4PAC24U25-S1
  • MOC-R4PAC24U09-S1
  • MOC-R4PAC24U07-S1
  • MOC-R4PAC10U19-S1
  • MOC-R4PAC10U05-S3

Confirmed! They've been inspected/provisioned upon, and are now available and out of maintenance mode.

hakasapl commented 3 months ago

Thanks! on to the next nodes...

hakasapl commented 2 months ago

Rebuilding neutron with new commits in network-runner should fix a lot of these issues, this will happen tomorrow, 5/7

tzumainn commented 2 months ago

Most of these now work thanks to hakan's update to the switch playbooks! A few still have issues; I've created two separate issues for those: