TinkerTools / tinker

Tinker: Software Tools for Molecular Design
https://dasher.wustl.edu/tinker/
Other
130 stars 61 forks source link

Bug in grid_mpole openmp loop #79

Closed JoshRackers closed 3 years ago

JoshRackers commented 3 years ago

This is weird one, but it only shows up when running on more than 1 thread. Here's a an example to reproduce.

xyzfile: watersmall.xyz keyfile:

parameters  ./water03.prm

openmp-threads 20

ewald
pme-grid 6 6 6

neighbor-list
a-axis 18.643

cutoff 7
taper 6

If you print out the grid, qgrid, in grid_mpole. The elements will be different every time. If you set openmp-threads 1, then the problem goes away.

This is probably a pretty inconsequential bug. If you set the grid size to something reasonable, like 20x20x20, the error goes away.

pren commented 3 years ago

Yes Tinker should exit when the grid size is this much smaller than box size.

From: Josh Rackers notifications@github.com Sent: Thursday, February 11, 2021 12:13 PM To: TinkerTools/tinker tinker@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [TinkerTools/tinker] Bug in grid_mpole openmp loop (#79)

This is weird one, but it only shows up when running on more than 1 thread. Here's a an example to reproduce.

xyzfile: watersmall.xyz keyfile:

parameters ./water03.prm

openmp-threads 20

ewald

pme-grid 6 6 6

neighbor-list

a-axis 18.643

cutoff 7

taper 6

If you print out the grid, qgrid, in grid_mpole. The elements will be different every time. If you set openmp-threads 1, then the problem goes away.

This is probably a pretty inconsequential bug. If you set the grid size to something reasonable, like 20x20x20, the error goes away.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/TinkerTools/tinker/issues/79, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNC6XVAFYVEXN47EMCPUTDS6QM3ZANCNFSM4XPJQXVA.

This message is from an external sender. Learn more about why this matters.https://ut.service-now.com/sp?id=kb_article&number=KB0011401

JoshRackers commented 3 years ago

Forgot to mention. I turned the minfft variable down to 4, to disable that. I also see issues when using a lot of threads for larger box sizes.

jayponder commented 3 years ago

First, Pengyu is correct that you should never use very small grid dimensions, even if the physical size of the periodic box is very small. That's the purpose of the minfft mechanism, which does not allow grid dimensions below 16. But that's not the issue here, I think...

I suspect the problem is in the setup routines in pmestuf.f. That was the very first OpenMP code ever written in Tinker, and was done by Dave Gohara (now at SLU) and me when I was just trying to learn to think "parallel". It contains what I might describe as a "poor man's" spatial decomposition. The mechanism for dividing work between threads is very crude. I've always known it might not stand up to very large numbers of threads, especially if the grid size is small.

Thanks, Josh, for coming up with a test case. I'll try to find time to take a look and see if I can fix this. But I suspect it's not a serious real-world problem for typical numbers of threads and grid sizes. And of course, most of Tinker certainly doesn't parallelize well at all across anything like 20 threads.

jayponder commented 3 years ago

So, this is indeed kind of weird. As Josh notes, it seems to only occur for relatively large numbers of threads with relatively small grid dimensions. And the issue occurs for putting partial charges, multipoles or induced dipoles onto the grid. The results are "correct" if you set the #threads to 1 or disable OpenMP at compile time. Of course those results are not accurate for very small grids, but at least they are reproducible and make some sense. Another bizarre observation is that if you put any write statement (it can just write a blank line) immediately following where qgrid is incremented in the innermost loop, then the answers are again "correct".

This certainly looks like an issue with the OpenMP setup of the code (mistake in private vs. shared, etc.), or perhaps a compiler bug (the GNU compiler's support for OpenMP has never really been very good). I've been through the OpenMP specs around the loops at issue, and I don't seen anything wrong. I've also compiled with different optimization levels and with bound checking, and neither of those have any effect.

I'll keep looking.. As above, I think the good news is that it's not a "serious" error for real world cases.

P.S. I also have fixed, but not yet pushed to Github, the "bug" that effectively kept the code from using a grid dimension less than 6. So soon you will be able use a PME grid of 2x2x2 if you really want to do so, and provided you have "minfft" disabled :)

jayponder commented 3 years ago

OK, this is fixed, and has been pushed to the master repo. My bad. To be filed under "It's a REDUCTION stupid!"

Actually we are probably lucky it only failed for the small grid/large thread count case. But OpenMP reductions seem to be unpredictable- often they "work" (and are probably faster..) when you don't declare them, but in some situations they do clearly fail without the explicit declaration.

JoshRackers commented 3 years ago

Fantastic! I hadn't had time to go digging for this yet, but I'm glad you did. I figured it must have been a private/public OpenMP problem. Thanks for fixing this!