cole-trapnell-lab / cufflinks

Boost Software License 1.0
310 stars 116 forks source link

Output directory creation can fail on lustre filesystems #53

Open dtrudg opened 8 years ago

dtrudg commented 8 years ago

We are using cufflinks on an HPC system with the main project space on a lustre filesystem. We noticed cufflinks routinely failing for users when they specified an absolute path for an output directory with -o

On further investigation it appears that the mkpath funtion in src/common.cpp is recursive and calling mkdir at each level of the requested path, from root to tip. If a level exists it expects, and handles, an EEXIST error from mkdir.

Unfortunately lustre can return EPERM for mkdir against an existing directory:

https://jira.hpdd.intel.com/browse/LU-4185

This is a long-standing issue with lustre, which doesn't appear to be going away. We don't hit it in other software that does recursive directory creation as they generally call stat to find how much of the structure exists, rather than relying on EEXIST being returned from mkdir.

I will patch cufflinks locally for this issue, to use stat in this manner. Would this be considered as a PR?

jrdemasi commented 8 years ago

Hey, @ctrapnell! Just curious how open you would be if I submitted a PR and tried to tackle this one to actually accepting a patch? Some people like their code to always be their code.

ctrapnell commented 8 years ago

That would be great! Thanks for offering to help.

Cole

Cole Trapnell

Sent from https://polymail.io/?utm_source=polymail&utm_medium=referral&utm_campaign=signature

On Tue, Aug 16, 2016 at 11:28 AM jrdemasi

< mailto:jrdemasi notifications@github.com

wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Hey, https://github.com/ctrapnell ! Just curious how open you would be if I submitted a PR and tried to tackle this one to actually accepting a patch? Some people like their code to always be their code.

You are receiving this because you were mentioned.

Reply to this email directly, https://github.com/cole-trapnell-lab/cufflinks/issues/53#issuecomment-240193529 , or https://github.com/notifications/unsubscribe-auth/AAR_GTFMYYd6SElKNYZjwic6fdsihIM5ks5qggE5gaJpZM4HarXG .

jrdemasi commented 8 years ago

@dctrud Didn't actually read the entirety of your post, but am not going to double up on a PR if you've already patched it. Can you let me know your progress on this?

Thanks!

dtrudg commented 8 years ago

@jrdemasi - sorry for very slow repy. Didn't actually patch locally in the end. We have a workaround for our workflows to just create directories first, and our vendor is going to supply a patched lustre as we have similar issues with other software, particularly MATLAB.

thiell commented 7 years ago

cufflinks should get mkpath fixed (particularly this: https://github.com/cole-trapnell-lab/cufflinks/blob/master/src/common.cpp#L283-L289) and call stat() because POSIX doesn't dictate that EEXIST should be returned first by mkdir in case of multiple errors, like EPERM and EEXIST, when for example directories were pre-created by another user and thus not writable. This behavior can happen on NFS too and should be fixed by the application.

jrdemasi commented 7 years ago

@thiell Is that issue directly related to the lustre bug described in this? I actually have a path for this but have been extremely lazy in making a PR (sorry, @ctrapnell!)

thiell commented 7 years ago

@jrdemasi yes.. it is related to this issue, but what I say is that it is first a cufflinks bug and not a lustre bug, as it is POSIX (weird) behavior that can also occur on NFS. I just noticed that another person proposed a PR (#80) for that issue too, it seems. We have several reports of users having this problem on our site so I ended up here. Recent versions of Lustre will include a workaround for this, as it is recognized that this POSIX behavior is inconsistent, but cufflinks should be fixed too IMHO.

jrdemasi commented 7 years ago

Oh, looks like I don't even need to submit a PR then unless @ctrapnell doesn't want a solution based in boost, as mine was just C++ I wrote without any new libraries.

Thanks for pointing that out

On Mar 4, 2017 12:08, "thiell" notifications@github.com wrote:

@jrdemasi https://github.com/jrdemasi yes.. it is related to this issue, but what I say is that it is first a cufflinks bug and not a lustre bug, as it is POSIX (weird) behavior that can also occur on NFS. I just noticed that another person proposed a PR (#80 https://github.com/cole-trapnell-lab/cufflinks/pull/80) for that issue too, it seems. We have several reports of users having this problems on our site so I ended up here. Recent versions of Lustre will include a workaround for this, as it is recognized that this POSIX behavior is a it weird, but cufflinks should be fixed too IMHO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cole-trapnell-lab/cufflinks/issues/53#issuecomment-284173602, or mute the thread https://github.com/notifications/unsubscribe-auth/ARqQJ3WEfgX5Q3jnLhlEeLlTTGCR3wFLks5ribaTgaJpZM4HarXG .