SCOREC / core

parallel finite element unstructured meshes
Other
183 stars 62 forks source link

Solution renumbering issue #207

Closed KennethEJansen closed 5 years ago

KennethEJansen commented 5 years ago

A1-allJets case processed with Chef on Theta has somehow developed a disconnect between the coordinates written to the restart and the solution. ParaView of the solution that is attached to be transferred (input to Chef) and the previous chef geombc.dat written during the previous chef run (2k-16k) look fine but even with a splitFactor of 1 (no partitioning), the output of Chef into the 16k-procs_case directory appears to be renumbered on each part. Neither Chef nor Paraview complain about non-matching files so the theory is that some renumbering is happening either: 1) between the time the mds files is loaded and the solution is read to be attached or 2) between the time the geombc.dat and restart files are written, or 3) on the previous run, between the time the geombc.dat file is written and the mds file is written.

Currently we cannot see this problem until phasta has been run in cases where there is no solution transfer (e.g., adapted cases like this). To better detect and find the source CWS and KEJ agreed it was time to have PHASTA be capable of writing its coordinates into the restart field (though I suppose this still requires a PHASTA run). It probably makes sense to have Chef write them into the restart as well so that we can detect the problem using ParaView before doing PHASTA runs.

KEJ Task 1) create version of PHASTA that writes coordinates (DONE) KEJ Task 2) test on a small problem to provide input for CWS chef developments (DONE) CWS: Ball is in your court to make a version of chef that checks to see if the field "coordinates" is in the restart file. If it is, after it is attached, check to see if that field agrees with mds view of coordinates.
CWS: If we get past that point, we will probably want to put assert checks at any point that we think might catch the 3 failure modes listed above:

Details of KEJ devs for Task 1) and 2).

I just made a new branch of PHASTA (https://github.com/PHASTA/phasta-next.git) called CoordsInRestrart. As the name implies, it writes the coordinates into the restart file in our usual phastaIO way viz. coordinates : < 30025 > 1251 3 4 is the header in posix and the field follows. I have tested this with phastaChefTests/incompressible which is our small 4 part case that I think Cameron has in git somewhere. That would probably be a good case to debug the development in Chef.

cwsmith commented 5 years ago

@KennethEJansen Can you point me at the test restart files, and corresponding mesh, that contains the coordinates field?

KennethEJansen commented 5 years ago

Colorado Viz nodes: /projects/tools/git-phasta/phastaChefTests/incompressible/4-procs_case

I think this is the case you made as a regression test for the incompressible code.

cwsmith commented 5 years ago

Thank you. It looks like time step 4 has the field:

/projects/tools/git-phasta/phastaChefTests/incompressible/4-procs_case $ grep -a ": < " restart.*.1 | grep coord
restart.4.1:coordinates : < 30025 > 1251 3 4 
cwsmith commented 5 years ago

@KennethEJansen I want to confirm my understanding of the sequence of chef and solver steps involved:

Is that correct?

If so, this indicates that solv_16Ki_t128_restart + chef_16Ki_t0_mds are not consistent with each other.
The 'good' solution seen in ParaView using solv_16Ki_t128_restart + chef_16Ki_t0_geombc only indicates that the part boundary connectivity information is correct, and boundary conditions are applied as specified. Visualizing time step zero, chef_16Kit0[restart,geombc], in ParaView will simply show a constant valued field and give no indication of an underlying reordering problem. So, we need to determine if chef_16Kit0[geombc,restart,mds] are consistent with each other.

To sanity check, and reproduce, the reordering problem seen after the 16Ki to 128Ki split, the following was done:

With the coordinate comparison modification to chef we could confirm that chef_16Kit0[restart,geombc,mds] are consistent with each other.

If all the parts in the mesh are reordered we can run the comparison tool on the geombc, restart, and mds files of a single part and avoid queueing time.

KennethEJansen commented 5 years ago

I agree with all of the above up to the last line. AFAIK, all parts are reordered but I don't know what is meant by "run the comparison tool...". I know you had referred to a load part tool but it doesn't load geombc (or even restarts).

That said, it just occurred to me that we missed an obvious easy check-process: 1) create a spatially varying IC in the smd file 2) run Chef to produce a spatially varying t0 solution 3) run Chef AGAIN with the t0 solution and solution migration turned on. This should load the solution and scramble it in the same way as a PHASTA t_N solution would.

KennethEJansen commented 5 years ago

On the A1 case, I made the initial pressure =0.001$x (from geom.smd)

40640 16 initial pressure 40641 0 40642 0 0 1 40643 0 40644 0 40645 0 3 0 40646 2 -1 40647 0 40648 8 0.001*$x

1) I ran Chef using this attributes to go from A1 512->2k f and visualized linear pressure in ParaView. 2) Chef 2k->16k from attached 2k solution (message indicates attributes ignored) shows linear pressure in PV 3) Chef 16k->16k from attached 16k solution (same message) is garbled. Logically this means that the mds mesh produced in step 2) does not match the numbering used at the time of writing of the solution from step 2) since, when that solution is attached to it in step 3) and then output structs are created and written to the procs_case directory they are garbled according to PV.

KennethEJansen commented 5 years ago

Cameron had hoped that this could be repeated on a smaller case. I tested it on our TwoCube Model. I made it as close to the failing case as I could e.g., I went back to Parasolid, made a new smd model, mesh and gave it the same linear pressure at /projects/tools/git-phasta/phastaChefTests/incompressible2-4-8/chef parted it to 2 and got a solution with smd atts that had a linear pressure, parted from 2 to 4 with the restart from atts on 2 parts attached (and solution transfer requested), then parted from 4 to 8 with 4's solution attached.

PV shows the same linear field for all three. Confirmed in the log of the second and third chef runs: All attribute-based initial conditions, if any, are ignored due to request for SolutionMigration

To be triple sure, I also modified the attributes in the 8 part case to make pressure initialized to 0 (just in case it was still getting it from the attributes). It produced the same linear pressure field. So, we cannot repeat the problem on this small case.

I am fairly sure that computing pressure from the mds coordinate (e.g., p_mds=0.001*coord[0])) will not agree with the attached field for pressure right after loading of the 16k part case which will then confirm that what is written into the mds file for 16k parts is differently numbered than at the time the phasta data structures where created (when they are written is not relevant).

I don't know enough about the guts of chef to know how to determine what is triggering a renumbering of the mds data base between those two operations OR if something scrambling the phasta data structures at the time of creation or sometime before they are written. Numbering seems the most likely choice since the maximums and minimums on a part seem to be preserved. I further have no clue why it does not happen until we reach 16k AND apparently only for this A1 case on Theta.

Jun PLEASE REVIEW AND CORRECT THIS. I dug a bit and only found A0KEJ went up to 16k processor?

Another difference in this case is the fact that it has SimModeler modified geometry. Just thinking out loud about how a tool chain that has been used for many problems has broken.

KennethEJansen commented 5 years ago

Cameron and I found the problem. Key lessons: 1) ALWAYS read your error file. It shows when renumbering happens. Better yet ALWAYS pipe -e and -o into the same file so that you can see in the timeline when events like renumbering happen 2) Renumbering is USUALLY triggered after a partitioning and BEFORE the generation of the PHASTA data structures. However, it was not in this case. The unique factor about this case was that, because PARMA was crashing, we set preAdaptBalanceMethod none in adapt.inp and this caused renumbering to be delayed until AFTER the structs for PHASTA were created. This happened in the 2k-16k case and not the 512-2k case (which reordered because of tetrahedronization) making the mds mesh for 16k differently numbered than the proc_case directory. The problem repeated for all subsequent partitions for the same reason. Cameron fixed it (and committed it) by forcing a renumbering ANY time splitFactor > 1

Going forward for the A1 case, we will have to create or own map to get the numbering back to the mds ordering or back to a different part count's numbering.

Going forward for other cases we need to be vigilant about point 1) above and probably check our results better BEFORE running: 1) visualize solutions that are transferred before running them (though this won't always catch the problem since geombc and restarts written by chef into a procs_case directory will be consistent no matter what and you will only find the problem if the solution is read, attached, and transferred from a chef that made the mistake in the previous partitioning). 2) check the map back to the prior partitioning (Jun's good suggestion) 3) introduce a spatially varying Temperature field (it can be very small like T_const +0.001$x) and then use solution transfer rather than re-evaluating each time and check this in PV.

cwsmith commented 5 years ago

I'm closing this issue. Please reopen if needed.

KennethEJansen commented 5 years ago

This is fine. I have verified that 128k is not scrambled.


Kenneth E. Jansen, Professor Ann and H.J. Smead Department of Aerospace Engineering Sciences ECAE 161 OFFICE (303) 492-4359 429 UCB FAX (303) 492-4990 University of Colorado at Boulder jansenke@colorado.edu Boulder, CO, 80309-0429 http://www.colorado.edu/aerospace


On Jan 10, 2019, at 3:20 PM, Cameron Smith notifications@github.com wrote:

Closed #207.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.