flatironinstitute / mountainlab-js

MountainLab is data processing, sharing and visualization software for scientists. It is built around MountainSort, spike sorting software, but is designed to be more generally applicable.
Other
43 stars 30 forks source link

unable to hard link, causes file copying #76

Open droumis opened 5 years ago

droumis commented 5 years ago

https://github.com/flatironinstitute/mountainlab-js/blob/e4bd5c26b7d604703ed4019f02cb7fe78412f816/mlproc/run_process.js#L1504

we (frank lab) are having trouble creating hard links on our server which caused copying of the whole file and significant slow down.

changing run_process.js to make symlinks instead of hard links fixes this.

Not sure if this is a good idea because naturally a symlink, even a 'fast' symlink, will always be slower than a hard link, but maybe not enough to matter in this situation. Are hard links required for some reason?

droumis commented 5 years ago

an alternative would be to just trigger symlink creation if hard links fail

magland commented 5 years ago

symlinks are problematic because the depend on the temporary file not being deleted, whereas hardlinks do not have this problem.

alexmorley commented 5 years ago

I feel like their should be a nicer solution here but I need to think about it.

wysota commented 5 years ago

Both symlinks and hardlinks have their issues. Some possible alternative would be to rely on filesystems such as btrfs that can make a copy of a file without actually copying the data (cp --reflink=always). So maybe the system can be made more robust by trying to hardlink first and when it fails, making a reflink copy (which will decay to a regular copy if reflink is not supported). Symlinking is not really an option despite having the advantage of being able to link across devices (which the other solutions can't do).

@droumis : What is the reason that you can't do hardlinks on your server?

tjd2002 commented 5 years ago

I can answer that for @droumis: The data we want to sort is stored on a large storage server mounted over the network (NFS), but we have access to fast, local SSD storage that we hope to use for temp files. Hardlinks only work within a single filesystem.

One solution for us would be to have our data and temp dirs on the same filesystem, either by copying our data over to the local device before sorting, or by using a temporary directory on the networked server. We will test both these options.

On Fri, Dec 7, 2018 at 11:43 PM Witold Wysota notifications@github.com wrote:

Both symlinks and hardlinks have their issues. Some possible alternative would be to rely on filesystems such as btrfs that can make a copy of a file without actually copying the data (cp --reflink=always). So maybe the system can be made more robust by trying to hardlink first and when it fails, making a reflink copy (which will decay to a regular copy if reflink is not supported). Symlinking is not really an option despite having the advantage of being able to link across devices (which the other solutions can't do).

@droumis https://github.com/droumis : What is the reason that you can't do hardlinks on your server?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/mountainlab-js/issues/76#issuecomment-445439834, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK3c5RjLcNH0L31hujkUDGUBsXT_ucAks5u222agaJpZM4ZEb_G .

alexmorley commented 5 years ago

This is working thinking about as this is a very common set-up in labs I've been to...

wysota commented 5 years ago

Falling back to NFS for the temporary cache is not an option, hard to call that a cache then. If your data fits on the fast SSD this would be your best choice. Symlinks will not do you any good since if you symlink to NFS storage, you get no benefit from the fast SSD.

tjd2002 commented 5 years ago

Understood about the symlinks. I think that’s a red herring—we just need to decide when it makes sense to copy and be deliberate about that.

The right answer is likely to be system-dependent. For instance, our File server connection is very fast (300MB/s read and write—yes MBytes). Our ‘local’ SSD is actually only 2x as fast (beeGFS). So it may not actually be worth it for us to copy the files from NFS rather than just reading them directly, depending on the read pattern.

Sent from my phone

On Dec 9, 2018, at 12:52 PM, Witold Wysota notifications@github.com wrote:

Falling back to NFS for the temporary cache is not an option, hard to call that a cache then. If your data fits on the fast SSD this would be your best choice. Symlinks will not do you any good since if you symlink to NFS storage, you get no benefit from the fast SSD.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wysota commented 5 years ago

Maybe it should just be a config option for the user with the default of hardlink-or-copy. The user could switch to hardlink-or-symlink or something like that.

tjd2002 commented 5 years ago

Maybe it should just be a config option for the user with the default of hardlink-or-copy. The user could switch to hardlink-or-symlink or something like that.

I'm concerned that if we allow mountainlab to create its output files by symlinking back to the temp dir, that we'll end up with a lot worse bug reports (... "Help, my spike sorting worked 6 months ago, but now half the output files are broken links to a non-existent filepath--where's my data??"). I guess a savvy user could select this option, then make a copy of the data while dereferencing symlinks (cp -L) before the cache is cleared, but this approach feels brittle/dangerous.

Currently, in run_process.js, we are using hardlinks both to link input files into the temp dir (with the function link_inputs, that calls move_file), and to copy the requested output files out of the temp dir (in the function move_outputs, which calls make_hard_link_or_copy).

I think it would be safe, and possibly appropriate, to use symlinks in the case of moving input files to the temp dir. But I think we definitely don't want to allow symlinks in the output case.

(It could also be that in some pipelines, we are implicitly relying on fast hardlinking of intermediate outputs without considering that they may be incurring the time penalty of a full file copy. There could be room to optimize there, too.)