E3SM-Project / transport_se

Atmosphere transport mini app

1 stars 1 forks source link

lim8 produces diff warnings when using COLUMN_OMP and -cc numa_mode #12

Closed halldm2000 closed 9 years ago

halldm2000 commented 9 years ago

Lots of difference values are printed in prim_advec after lim8 when COLUMN_OMP=true, and number of threads > 1 and -cc numa_node flag is used on Edison. Have examined the DCMIP solutions and found them to be damaged in this case. So the errors are real, not spurious. These errors might be related to issue #11.

mt5555 commented 9 years ago

The code printing these differences is testing mass conservation, so yes, this issue is identical to issue #11

I'm not sure who put this check and print statements in the code - we should be sure to remove them once this issue is resolved.

halldm2000 commented 9 years ago

I don't think they are identical. They are both dealing with mass conservation, but unlike the general conservation problems, these errors arise only with -cc numa_node. The #11 errors arise even without numa_node.

mt5555 commented 9 years ago

Sorry about my confusion! I haven't been able to keep up. Would you mind documenting exactly what you mean by these two errors?

12: are these the print statements triggered by the code that checks the mass before and after limiter8?

11, I was thinking the same error - but we could also interpret #11 as addressing the error fixed by the HOMME commits posted there. I dont actually know what that error is.

halldm2000 commented 9 years ago

To clarify, issue #12 was created to solve the problem that the mass before and after lim8 has changed. diff = mass2 -mass1. It produces endless output like this: ie,k= 5 1 diff= 1.144409179687500E-005 sums= 19838327949.5933 19838327949.5932
ie,k= 5 3 diff= 1.144409179687500E-005

these difference messages only appear in the case described above when -cc numa_node is used.

mt5555 commented 9 years ago

In this context, a diff of order 1e-5, means a change in the 15 digits. Look how well the two sums agree. So I suspect this is actually ok. If the errors are all of this level, we should remove this check, as well as the associated tmp1 and tmp2 arrays.

(conservation errors, if still present, will be detected in our other diagnostics)

mt5555 commented 9 years ago

By the way, this check, and the tmp1 and tmp2 arrays, is not supposed to be part of the model. AT some point there must have been a bug, which someone tracked down to the limiter, and added this check during the debug phase. Perhaps the mini-app inherited this from standalone HOMME?

halldm2000 commented 9 years ago

after fixing issue #11, still see non-conservation when using numa_node + column_omp by commenting out COLUMN_OMP loops in batches traced the problem to edge_mod.F90 by commenting those out in batches, traced the problem to first omp loop in edgeVunpack. replaced parallel loop over i with loop over k. (to avoid threads working on the same data.)

halldm2000 commented 9 years ago

tested fix with run_ne_tests. mass conservation restored. error norms look good. plots look good.