NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
5 stars 4 forks source link

Failed jobs due to missing DW environment within flux allocation #194

Open mcfadden8 opened 1 month ago

mcfadden8 commented 1 month ago

General Problem: When running sequentially submitting 532 single-node jobs to 532 nodes on the El Cap iotesting queue, I ran in to two problems. The good news is that 523 jobs successfully ran. But, 6 jobs reported an error and 3 jobs were killed by signal. This thread pertains to the 6 jobs reporting an error. (This is reproducible)

bdevcich commented 3 weeks ago

Marty, can you provide more detail here or can this be closed?

mcfadden8 commented 3 weeks ago

https://llnl.slack.com/archives/C020U81E05U/p1723151132603649

mcfadden8 commented 3 weeks ago

Focusing in on 1 of the 3 jobs that were killed, I see the following from flux:

flux job info f2BVr7EH4GB9 eventlog
{"timestamp":1723142411.257493,"name":"submit","context":{"userid":54987,"urgency":16,"flags":0,"version":1}}
{"timestamp":1723142411.4714682,"name":"validate"}
{"timestamp":1723142411.7227607,"name":"dependency-add","context":{"description":"dws-create"}}
{"timestamp":1723142469.0961509,"name":"memo","context":{"rabbit_workflow":"fluxjob-508774988934663168"}}
{"timestamp":1723142484.6505346,"name":"dependency-remove","context":{"description":"dws-create"}}
{"timestamp":1723142484.6505933,"name":"depend"}
{"timestamp":1723142484.6506846,"name":"priority","context":{"priority":16}}
{"timestamp":1723142484.890485,"name":"alloc","context":{"annotations":{"user":{"rabbit_workflow":"fluxjob-508774988934663168"}}}}
{"timestamp":1723142484.8907204,"name":"prolog-start","context":{"description":"job-manager.prolog"}}
{"timestamp":1723142484.8907382,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1723142484.8907464,"name":"prolog-start","context":{"description":"dws-setup"}}
{"timestamp":1723142484.9701443,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1723142531.349112,"name":"memo","context":{"rabbits":"elcap438"}}
{"timestamp":1723142657.4949949,"name":"exception","context":{"type":"exception","severity":0,"note":"DWS/Rabbit interactions failed: workflow in 'TransientCondition' state too long: None","userid":765}}
{"timestamp":1723142657.4951036,"name":"prolog-finish","context":{"description":"dws-setup","status":1}}
{"timestamp":1723142657.4951787,"name":"epilog-start","context":{"description":"dws-epilog"}}
{"timestamp":1723142658.4516304,"name":"exception","context":{"type":"prolog","severity":0,"note":"prolog killed by signal 15 (timeout or job canceled)","userid":765}}
{"timestamp":1723142658.4516723,"name":"prolog-finish","context":{"description":"job-manager.prolog","status":36608}}
{"timestamp":1723142710.1166995,"name":"epilog-finish","context":{"description":"dws-epilog","status":0}}
{"timestamp":1723142710.1171415,"name":"free"}
{"timestamp":1723142710.1171782,"name":"clean"}

Flux created no output file for stdout and stderr

mcfadden8 commented 3 weeks ago

Looking at the logs associated with this job, I see:

grep 508774988934663168 * | grep -i ERROR
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:32 elcap4795 clientmountd[47577]: 2024-08-08T11:43:32-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "d567d7e2-6da0-4f80-bbc9-dc3c7413f422", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:43:52 elcap4795 clientmountd[47577]: 2024-08-08T11:43:52-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "6439d0d0-a995-4f5a-9d0c-d64213aa7dd7", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:02 elcap4795 clientmountd[47577]: 2024-08-08T11:44:02-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "58c7cc0c-076b-4a3a-9391-dd14c857ef9c", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:12 elcap4795 clientmountd[47577]: 2024-08-08T11:44:12-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "e71e1049-de3e-4eb4-b36d-37f63851a7ba", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:23 elcap4795 clientmountd[47577]: 2024-08-08T11:44:23-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "4f72d7a3-7926-4117-b99f-7527431b6d89", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:33 elcap4795 clientmountd[47577]: 2024-08-08T11:44:33-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "dbebe05a-1509-45f3-9822-51b1e721f4c8", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        INFO        controllers.ClientMount        internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        INFO        controllers.ClientMount        Recoverable Error        {"ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "Severity": "Major", "Message": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
compute.journalctl.elcap4795:Aug 08 11:44:43 elcap4795 clientmountd[47577]: 2024-08-08T11:44:43-07:00        ERROR        Reconciler error        {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "ClientMount": {"name":"default-fluxjob-508774988934663168-0-computes","namespace":"elcap4795"}, "namespace": "elcap4795", "name": "default-fluxjob-508774988934663168-0-computes", "reconcileID": "d13fe73a-7bf0-40cc-9d55-158026440999", "error": "internal error: mount/unmount failed: unable to activate block device: timeout waiting for device stat /dev/mapper/f6cd4565--7c06--4712--a31c--7f2db293bf57_0-lv--0: no such file or directory"}
grep: rabbit.pods.elcap438: Is a directory

Note: I had to snip out some messages in the middle of the grep above in order to fit in the past.