NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
5 stars 4 forks source link

Passing negative-1 `MaxWaitTime` hangs `DataMovementStatusRequest` indefinitely #190

Open mcfadden8 opened 3 months ago

mcfadden8 commented 3 months ago

The documentation says: "", but the data movement status request never call never returns.

2024-08-01 13:19:49:780 AXL rzadams1075: @ nnfdm_start:177 nnfdm::CreateRequest(src=/mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0000-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/1/xxxx00000/xxxx-0000-00000.silo)
2024-08-01 13:19:49:804 AXL rzadams1075: @ nnfdm_start:177 nnfdm::CreateRequest(src=/mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx00000.root, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/1/xxxx00000.root)
2024-08-01 13:19:49:820 AXL rzadams1075: @ nnfdm_wait:352 0
2024-08-01 13:19:49:820 AXL rzadams1075: @ nnfdm_stat:65 /mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0000-00000.silo

The same call will work if I pass 1 second and continue to poll for between 5 and 10 seconds.

bdevcich commented 3 months ago

It hangs even when the NnfDataMovement resource in kubernetes shows that it's finished? Can you check that once you make it hang?

This part of the API has always bothered me because I think a good API should always respond as quickly as possible to the client to minimize wait time and also confirm that nothing is wrong. It's like asking someone a question and they never respond.

Is this something that you use a lot?

mcfadden8 commented 2 months ago

How do I check that? Do you happen to have a test for this? Under what circumstances does it work?

I was only attempting to use it because the documentation said that I could. I reverted back to polling with a one-second timer. But we have use cases where users just want to wait until the copy is done before proceeding.

bdevcich commented 2 months ago

How do I check that? Do you happen to have a test for this? Under what circumstances does it work?

As it's running (and presumably hanging), you can query the NnfDataMovement resource in k8s. You won't be able to do this in your application unless the compute nodes have k8s access, but you could do it from somewhere that does. This is basically what the DataMovementStatusRequest is doing for you:

kubectl get -n <rabbit-hostname> nnfdatamovements <request UID>

So if compute-node-1 was attached to rabbit-node-1 and the DataMovementCreateRequest returned a UID of nnf-dm-node-5vghx, you can do this to query it:

$ kubectl get nnfdatamovement -n rabbit-node-1 nnf-dm-node-5vghx
NAME                STATE      STATUS    ERROR   AGE
nnf-dm-node-5vghx   Finished   Success           4m54s

A MaxWaitTime of -1 is not going to respond until that nnfdatamovement is done. So if it's a large request, it's going to appear to hang since the response won't come until it's finished. I'm hoping that's what happening here. If the nnfdatamovement resource is showing Finished and it's not responding, then we have an issue.

I reverted back to polling with a one-second timer. But we have use cases where users just want to wait until the copy is done before proceeding.

I think this is the best way to do this. It ensures that the server is responding and isn't hung.