Open jsteel44 opened 4 years ago
Hello,
I appear to be hitting this issue on a larger stage in.
dacd: Time up, waited more than 5 mins to complete.
dacd: Error in remote ssh run: 'bash -c "export DW_JOB_STRIPED='/mnt/dac/206543_job/global' && sudo -g '#1476600005' -u '#1476600005' rsync -r -ospgu --stats /path/to/app/stagein/ \$DW_JOB_STRIPED/"' error: signal: killed
In my Slurm config I increased the StageInTimeout and StageOutTimeout values but I am assuming these have no impact?
scontrol show burst
Name=datawarp DefaultPool=default Granularity=1500GiB TotalSpace=180000GiB FreeSpace=180000GiB UsedSpace=0
Flags=EnablePersistent,PrivateData
StageInTimeout=3600 StageOutTimeout=3600 ValidateTimeout=1200 OtherTimeout=1200
GetSysState=/usr/local/bin/dacctl
GetSysStatus=/usr/local/bin/dacctl
A 5 minute timeout doesn't fit my use case making DACC unfit for purpose. Is there any way to set the timeout values in the dacd or Slurm burst buffer config?
Regards, Matt.
Edit: I installed from the data-acc-v2.6.tgz release where the timeout should be 10 minutes - has there been a regression on this commit?
Thanks for you feedback.
Totally makes sense to make this configurable. I am more than happy to review patches to help with that. We don't have anyone funding further devlopment of this right now, otherwise I would look into that patch myself.
The slurm config you have there sounds correct. Certainly Slurm can decide to give up waiting for the dacctl call independently of the DAC timing out, which is currently hardcoded.
In v2.6 the ansible timeout has increased to 10 mins, but the SSH command timeout is still 5 mins: https://github.com/RSE-Cambridge/data-acc/blob/4e890f41c8df12bfc4949bc093ffc32877934208/internal/pkg/filesystem_impl/mount.go#L278
I agree the best approach is to make the above configurable. Moreover, I think only the copy command will want to increase the timeout, as the other commands using this code really want a shorter timeout (with a separate configuration)
The test users we worked with made almost no use of the copy functionality, I believe the generally wanted more control, so did the copy work inside their job scripts instead. It is nice to hear about people using the copy feature. It is worth knowing this uses only a basic single node "rsync" copy, and doesn't attempt a more agressive parallel copy, with the idea that the DAC shouldn't apply too much presure on the typically slower filesystem it will be copying from.
Hello,
Thanks for the quick reply.
It sounds like a tactical workaround to this is create a buffer pool with a low amount of data and then the first step of the job is to copy the data in rather than having all data transfer in the buffer API. I will suggest my client use that as a quick fix however, it is a little clunky given the capability of burst.
The NFS storage is indeed far slower than a parallel file system having been purposely underspeced with the expectation DAC will run all the high speed parallel transactions on compute nodes during job runtime.
I'll be keen to see how this develops for a longer term strategic fix with configurable variables in dacd.conf.
Regards, Matt.
There is a hardcoded timeout of 5 minutes: https://github.com/RSE-Cambridge/data-acc/blob/fcf9efecb6ebb57c060fa7c65e6029119bf5eb92/internal/pkg/filesystem_impl/ansible.go#L243
We hit this timeout occasionally so it would be nice to give it a bit more time.
Thanks