LLNL / Silo

Mesh and Field I/O Library and Scientific Database
https://silo.llnl.gov
Other
25 stars 22 forks source link

Failure of db_hdf5_MkDir #339

Closed nDimensionalSpace closed 11 months ago

nDimensionalSpace commented 11 months ago

db_hdf5_MkDir returns the error text "Low-level function call failed". This occurs on both rzwhippet and rztopaz (toss_4), but not on rzgenie (toss_3) (to the extent that I can determine). Occurrences seem stochastic and non-repeatable, but I have seen the error occur between 1 and 6 hours into a run, and in particular, not on the first plot state. Screenshot attached.

Screen Shot 2023-08-02 at 7 44 49 AM

markcmiller86 commented 11 months ago

Are you using the toss4 builds on the rz for the cases that are failing?

nDimensionalSpace commented 11 months ago

No, we aren't. On toss3, for some reason, we were doing our own build. Maybe you made some changes for us that weren't publicly available, I can't recall . . . That being said, the source we are using is labelled 4.10.2.

markcmiller86 commented 11 months ago

Ok, does that mean you are using toss3 builds on the toss4 systems though?

If so, I worry that using a toss3 build on a toss4 system, some behavior may be indeterminiate.

There are toss4 builds available there on rz for Silo 4.11.0 I think.

nDimensionalSpace commented 11 months ago

Ah right. No, we rebuilt all of our libs for toss4. I am looking at the settings file right now, and it has the requisite references to toss4 throughout. I was just wondering about whether you all did any kind of build tuning associated with changes in the fs . . .

That being said, I am happy to try the system builds, if you think it might make a difference.

markcmiller86 commented 11 months ago

I was just wondering about whether you all did any kind of build tuning associated with changes in the fs . . .

In the jump from 4.10.3 to 4.11, a lot changed and yes, I think some performance issues were tuned. That said, none of it was file system specific. Are all of your runs on Lustre or some on Lustre and some on IBM Spectrum Scale (GPFS)?

nDimensionalSpace commented 11 months ago

The ones that are erroring out are all on lustre.

markcmiller86 commented 11 months ago

@nDimensionalSpace apologies but what version of Silo?I should adjust Silo's error messages to include that.

nDimensionalSpace commented 11 months ago

4.10.2

nDimensionalSpace commented 11 months ago

@markcmiller86: Any further thoughts on this, or is "try the system builds" the primary tactic at this point? Thx

markcmiller86 commented 11 months ago

Sorry...very puzzling. I was expecting you were going to tell me you are seeing on 4.11, not 4.10.2 (or 4.10.3). That said, I guess I am still not clear on one point...did you build 4.10.2 for toss4? If so, did you also build HDF5 for toss4?

nDimensionalSpace commented 11 months ago

Yes, we built HDF5 for toss4, and used that as the dependency for Silo 4.10.2 to be built on toss4 as well.

markcmiller86 commented 11 months ago

Yes, we built HDF5 for toss4, and used that as the dependency for Silo 4.10.2 to be built on toss4 as well.

Ok, am slammed with meetings today. Expect to take a look tomorrow PM. I am not sure I'll be able to reproduce without more info.

nDimensionalSpace commented 11 months ago

Hmmm . . . You likely won't be able to reproduce. Now that I think about it, I think your best response to me is really something like "come back when you can reproduce this effect". In that vein, let me go and do some more testing, and see what I can see.

nDimensionalSpace commented 11 months ago

Grrr . . . I just spent the last two days running all possible permutations of the effected problem, and no luck. 😐

I am going to close this issue now. But if it comes up again and I recall this discussion, I will reopen this guy, and pass you the files.

markcmiller86 commented 11 months ago

I wonder if it was intermittent file system behavior. That is not unheard of. Thanks for taking the time to reproduce. If you have a moment and can outline here in words the basic paradigm you think was triggering it, I might be able to think through a possible reproducer.

nDimensionalSpace commented 11 months ago

@markcmiller86: Hmmm . . . Ok, everything I imagine about this:

(1) The version of silo I was using is based on our HDF build, which afair, has a few special build flags to improve performance. (2) The problem I was running was for debugging, so I was writing out Silo states every time step. But the time steps were pretty fast, so if there was lag in the file system, maybe the previous plot state was not closing fast enough. (3) It is also the case that certain processes had no or almost no mesh on them, while some processes were not huge, but large enough. So, same thing: if the previous process was not done writing a file before the next process started trying to do stuff, maybe that was causing issues. (4) Maybe (2) and (3) were combining in some especially pernicious way.

All of those things seem vaguely like grasping at straws, but that is all I have.