TinkerTools / tinker9

Tinker9: Next Generation of Tinker with GPU Support
Other
48 stars 26 forks source link

Odd Spatial2 requested way to many elements #206

Closed BJWiley233 closed 1 year ago

BJWiley233 commented 2 years ago

Hi Zhi,

I am running bar9 on 2 trajectories each with 5000 frames. BAR finishes with Completed 5000 Coordinate Frames but then I get this error that I think we talked about before:

Terminating with uncaught exception :  An internal array in Spatial2 requested 9034673 elements, but only 204720 (48*4265) were allocated. Please increase Spatial::LSTCAP (current value 48) so as to make Spatial::LSTCAP*4265 >= 9034673.
  at /home/tinker9/src/cu/spatial.cu:693

I am using tinker9 from this image tinkertools/tinker9:cuda10.1-20220606-09e1bfcc

zhi-wang commented 2 years ago

We are looking into this. This usually happens when a frame of the trajectory is broken. We believed for the problems we recently saw, the simulation was fine, but for some reason, the frame saved in trajectory was broken on rare occasions. Before we catch this bug, I don't have a better suggestion than ignoring/removing this frame from the arc file.

BJWiley233 commented 2 years ago

Ok. BAR reports 100 frames at a time so if I have 5000 frames and I get this after Completed 5000 Coordinate Frames does that mean it's one of the frames in the last 100?

zhi-wang commented 2 years ago

I think when bar is running, it will handle two trajectory files. This looks like the first trajectory is fine, but something is broken in the next trajectory.

zhi-wang commented 2 years ago

This is my first attempt to fix the problem: docker pull tinkertools/tinker9:cuda10.1-20220822-e2585f00 and docker pull tinkertools/tinker9:cuda11.2.2-20220822-e2585f00

As you may have noticed, the corrupted frames don't happen very often, I can't say this issue is definitely fixed with limited tests so far, but I definitely caught something in the code that was suspicious to me. We will also do more tests in our lab. Thanks again.

BJWiley233 commented 2 years ago

Yes I think I may have been getting read/write issues on my HPC w.r.t the GPUs and so corrupted frames could be my issue. Will test with new image but I think I have to rerun the end of some simulations.