Closed rabernat closed 6 years ago
Update: I was able to work around this with the following
LDIR=/gpfs/flash/users/$USER/dask
UUID=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
...
--local-directory $LDIR/$UUID
I think the problem is that dask does not expect any other worker processes to be using local-directory.
In general I recommend avoiding pointing the --local-directory
to network storage. Many distributed small writes is a good way to crash a network file system. I've personally seen this happen at a few large facilities.
You might want to read http://dask.pydata.org/en/latest/setup/hpc.html#no-local-storage
In general though it sounds like the file locking system may have issues on network file systems. It would be nice to fix this if/when people have the time.
Keep in mind that Wrangler is highly specialized for big data analysis and is not a typical HPC system. I don't know what exactly the /flash
storage system is, but it is amazingly performant and resilient.
I asked the sysadmins what partition to use for temporary data in the following way:
My application (dask) occasionally needs to spill temporary data to disk. The potential size of this temporary data is ~100 GB per node. This location does not have to be globally readable. The ideal solution would be a fast local drive, such as an SSD. But I can't find any info about local storage in the wrangler docs. There are also several global filesystems (/work, /data, flash, etc.), but I don't know if they are meant for this sort of thing. What filesystem do you suggest I use?
They responded as follows:
Keep in mind also that the /gpfs/flash file system, while global, is effectively local to each node on Wrangler due to the DSSD architecture. So if the latency of your IO is an issue here /gpfs/flash will be your fastest option. Either /data or /gpfs/flash will work, it's just a matter of how performance-sensitive these operations are.
So I think this issue is really about multiple workers using the same directory. Once I worked around that, it worked great.
Have you ever encountered such a problem with multiple workers using the same directory on a local file system?
On Mon, Jan 15, 2018 at 11:34 PM, Ryan Abernathey notifications@github.com wrote:
Keep in mind that Wrangler https://portal.tacc.utexas.edu/user-guides/wrangler is highly specialized for big data analysis and is not a typical HPC system. I don't know what exactly the /flash storage system is, but it is amazingly performant and resilient.
I asked the sysadmins what partition to use for temporary data in the following way:
My application (dask) occasionally needs to spill temporary data to disk. The potential size of this temporary data is ~100 GB per node. This location does not have to be globally readable. The ideal solution would be a fast local drive, such as an SSD. But I can't find any info about local storage in the wrangler docs. There are also several global filesystems (/work, /data, flash, etc.), but I don't know if they are meant for this sort of thing. What filesystem do you suggest I use?
They responded as follows:
Keep in mind also that the /gpfs/flash file system, while global, is effectively local to each node on Wrangler due to the DSSD architecture. So if the latency of your IO is an issue here /gpfs/flash will be your fastest option. Either /data or /gpfs/flash will work, it's just a matter of how performance-sensitive these operations are.
So I think this issue is really about multiple workers using the same directory. Once I worked around that, it worked great.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1693#issuecomment-357851383, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszELb9-QSOBVysWc2iKuS-ktvikHGks5tLCbQgaJpZM4RfDL7 .
By the way it looks like if you add the following line to your ~/.dask/config.yaml
file it will avoid file locking altogether
use-file-locking: False
If you have an opportunity to try your situation with the fix in https://github.com/dask/distributed/pull/1714 I would appreciate it
I am trying to run dask on TACC wrangler. I am getting errors related to local disk storage.
Here is the SLURM jobscript I'm using to launch the workers https://github.com/pangeo-data/pangeo/blob/master/utilities/wrangler/launch-dask-worker.sh
Note that wrangler has a fast global storage system, which I use for dask's local dir. Specifically, in my jobscript (slightly updated from the one linked above), I say
I create a cluster with approximately 10 such workers. The total size of the final cluster is
As a test, say I want to
persist
a dataset bigger than the amount of memory in my cluster. This should be possible with local storage. I doI start seeing the following errors in the worker logs
There are indeed some worker directories in the local directory
But
worker-apt4oj1w
is missing?Why would this happen? What is going on here?
This problem with local storage is infecting everything else I try to do on Wrangler.