RSE-Cambridge / data-acc

Data Accelerator: Creates a burst buffer from generic hardware and integrates it with Slurm https://www.hpc.cam.ac.uk/research/data-acc http://www.stackhpc.com
https://rse-cambridge.github.io/data-acc
Apache License 2.0
17 stars 11 forks source link

Make the 5 minute timeout configurable #121

Open jsteel44 opened 4 years ago

jsteel44 commented 4 years ago

There is a hardcoded timeout of 5 minutes: https://github.com/RSE-Cambridge/data-acc/blob/fcf9efecb6ebb57c060fa7c65e6029119bf5eb92/internal/pkg/filesystem_impl/ansible.go#L243

We hit this timeout occasionally so it would be nice to give it a bit more time.

Thanks

ocfmatt commented 3 years ago

Hello,

I appear to be hitting this issue on a larger stage in.

dacd: Time up, waited more than 5 mins to complete.
dacd: Error in remote ssh run: 'bash -c "export DW_JOB_STRIPED='/mnt/dac/206543_job/global' && sudo -g '#1476600005' -u '#1476600005' rsync -r -ospgu --stats /path/to/app/stagein/ \$DW_JOB_STRIPED/"' error: signal: killed

In my Slurm config I increased the StageInTimeout and StageOutTimeout values but I am assuming these have no impact?

scontrol show burst
Name=datawarp DefaultPool=default Granularity=1500GiB TotalSpace=180000GiB FreeSpace=180000GiB UsedSpace=0
  Flags=EnablePersistent,PrivateData
  StageInTimeout=3600 StageOutTimeout=3600 ValidateTimeout=1200 OtherTimeout=1200
  GetSysState=/usr/local/bin/dacctl
  GetSysStatus=/usr/local/bin/dacctl

A 5 minute timeout doesn't fit my use case making DACC unfit for purpose. Is there any way to set the timeout values in the dacd or Slurm burst buffer config?

Regards, Matt.

Edit: I installed from the data-acc-v2.6.tgz release where the timeout should be 10 minutes - has there been a regression on this commit?

JohnGarbutt commented 3 years ago

Thanks for you feedback.

Totally makes sense to make this configurable. I am more than happy to review patches to help with that. We don't have anyone funding further devlopment of this right now, otherwise I would look into that patch myself.

The slurm config you have there sounds correct. Certainly Slurm can decide to give up waiting for the dacctl call independently of the DAC timing out, which is currently hardcoded.

In v2.6 the ansible timeout has increased to 10 mins, but the SSH command timeout is still 5 mins: https://github.com/RSE-Cambridge/data-acc/blob/4e890f41c8df12bfc4949bc093ffc32877934208/internal/pkg/filesystem_impl/mount.go#L278

I agree the best approach is to make the above configurable. Moreover, I think only the copy command will want to increase the timeout, as the other commands using this code really want a shorter timeout (with a separate configuration)

The test users we worked with made almost no use of the copy functionality, I believe the generally wanted more control, so did the copy work inside their job scripts instead. It is nice to hear about people using the copy feature. It is worth knowing this uses only a basic single node "rsync" copy, and doesn't attempt a more agressive parallel copy, with the idea that the DAC shouldn't apply too much presure on the typically slower filesystem it will be copying from.

ocfmatt commented 3 years ago

Hello,

Thanks for the quick reply.

It sounds like a tactical workaround to this is create a buffer pool with a low amount of data and then the first step of the job is to copy the data in rather than having all data transfer in the buffer API. I will suggest my client use that as a quick fix however, it is a little clunky given the capability of burst.

The NFS storage is indeed far slower than a parallel file system having been purposely underspeced with the expectation DAC will run all the high speed parallel transactions on compute nodes during job runtime.

I'll be keen to see how this develops for a longer term strategic fix with configurable variables in dacd.conf.

Regards, Matt.