Force gradient current constraint

abaillod commented 4 years ago

Issue #103 has been solved and I am now confident that the implementation of the force gradient in the case of Lconstraint=3 is correct. We can thus now start testing the branch ForceGradient_CurrentConstraint on other machines, and hopefully merge soon!

I designed a few tests:

First, try to compile and run the file _/ci/G3V08L3FrFG/G3V08L3Fr.sp. This is a multi-volume, free-boundary rotating ellipse. It should compute the force and the force gradient and then stop. You can compare that you get the same force gradient than the one I obtain on my machine, using matlab:

d1 = read_spec_hessian('compare.sp.h5');
d2 = read_spec_hessian('G3V08L3Fr.sp.h5');
delta = max(max(abs(d1-d2)))

You should obtain a delta close to machine precision, whether you run on a single or multiple CPUs.

Second, check that the force gradient in the case of Lconstraint=1 remain unchanged. I placed in the directory _/ci/G2V32L1FiFG/ a file compare.sp.h5 and its force gradient .compare.GF.ma, obtained with the master branch. Run the case G2V32L1Fi.sp and check that you get the same force gradient, following the same steps as above. On my machine, I get delta~1E-11
Last, check the convergence (if you have the time, this is not necessary). First, compile SPEC in debug mode. Then, into the directory InputFiles/Verification/currentconstraint/ForceGradient/, you can find three directories:
- _SlabFreeBoundComparison, which is the case of a free boundary slab
- _ScrewPinchComparison, which is the case of a fixed boundary screw pinch
- _TorusComparison, which is again the case of a free boundary rotating ellipse

in any of those directories, you can run the command

matlab -nodesktop < ConvergenceStudy.m

and a convergence plot should be generated. In slab, you observe a convergence till machine precision, in the screw pinch case a convergence up to 1E-14 and in the toroidal case up to 1E-12.

I think that if the two first points work without problem at ANU, IPP, SPC and PPPL, we are safe for merging. Please report here any bug, I am happy to solve it!

smiet commented 4 years ago

I have just pushed a change to the branch because the code was not compiling with GCC 10.

The error was a mismatch between calls to dgesvx (LAPACK routine) in line 679 and line 693. berr and ferr were passed as integers in the first call, whereas the type specification requests a variable sized array.

Previously GCC allowed this loosey-goosey approach, but in 10 they started strictly enforcing and not allowing an integer to substitute for a length 1 array.

The change is in line 117 which now reads: REAL :: Rdgesvx(1:NN), Cdgesvx(1:NN), work4(1:4*NN), rcond, ferr(1), berr(1), ferr2(1:2), berr2(1:2)

i.e. ferr and berr are lenght 1 arrays. This should not break anything else, and adheres better to the FORTRAN specification.

Just out of curiosity, in that line ferr2 and berr2 are initialized as lenght 2 arrays, but passed as ferr2(1:MM). Is MM equal to 2, or are we passing memory beyond that allocated for the variable, and possibly overwriting other memory with this forward error array?

abaillod commented 4 years ago

Thank you @smiet for catching that. However the automatic testing are not successful anymore - any idea what caused that?

Regarding ferr2 and berr2, you see at line 686 MM=2, so when using ferr2(1:MM) and berr2(1:MM) at lines 695 it doesn't access memory outside what has been allocated.

smiet commented 4 years ago

The automatic testing has not failed, but has stopped running for some reason. They are upgrading the cluster, and therefore possibly the runner has stopped working. I will try to get this fixed ASAP

I am trying to replicate the tests, but I am using the SPEC matlab for the first time. and am having issues. Seems many routines rely on having read_hdf5.m in the path, but that file is nowhere in the repository.

(Is it in a repository maintained by S. Lazerson?) read_spec.m seems to work similarly. I get a delta of 8*10E-11, not exactly machine precision, but also not huge.

abaillod commented 4 years ago

Sorry, my bad about the matlab routines. _read_spechessian was outdated - I committed a new version and it should work now.

smiet commented 4 years ago

The first two tests are working well for me, and I get the same force gradient!

Unfortunately I am running into one more issue, when I the last convergence study that you mention, I am so sorry!

Slab works well, but when I run the screwpinch and torus cases, I get fatal errors in running dspec! I guess we haven't been testing dspec as well. The error does not pop up if I run the .sp file with xspec

The toroidal case fails on a write statement in dforce: (what on EARTH is going on?) Torus_dspec.log Torus_matlab.log line 767 in dforce is where it goes wrong

Whereas the screwpinch tries to access an element outside of the array dbdx2 in dfp200: Screwpinch_matlab.log

arunav2123 commented 4 years ago

@smiet Specifically comment out these lines

write(ounit,1345) im(ii), in(ii), hessian(ii,:) write(ounit,1346) im(ii), in(ii), finitediff_hessian(ii,:) 1345 format("dforce: ; (",i4,",",i4,") ; Hessian = ",641F16.10 " ;") 1346 format("dforce: ; (",i4,",",i4,") ; Finite differences = ",641F16.10 " ;")

These are just for screen outputs. Outputs are saved in a file for comparison.

arunav2123 commented 4 years ago

@smiet Also, to solve screwpinch_matlab.log problem ,please replace line 3603 in dfp200_m.f90 or 987 in dfp200.F90 by

if (vvol.lt.Mvol-1) then ; dBdx2(vvol+1) = Bt00(vvol+1, 1, -1) endif

Let me know, if this work .

zhisong commented 4 years ago

@smiet @zhucaoxiang Can you resolve the conflict in read_spec.py?

I don't know if you have kept a clean version of it.

zhucaoxiang commented 4 years ago

@zhisong We can easily revert it. But I think we need probably a re-organize and clean. Right now, there are too many scripts that are repetitive. I will find a time to discuss with @smiet .

zhisong commented 4 years ago

@zhucaoxiang Sure. Let me know as well.

smiet commented 4 years ago

@arunav2123 We should probably comment out those lines in the files that we will merge as well. Though I cannot understand why those lines would not throw exceptions when compiled with on your systems.

This now works for the slab and the screw pinch: Convergence_Slab.pdf Convergence_ScrewPinch.pdf

Though the torus is still giving me trouble. It exits on a FATAL, but I believe that is intended? Nevertheless, though it does seem to run to it's intended completion, it does not converge. I am attaching the convergence plot as well as the generated files for Run_4, as a sanity check. Convergence_Torus.pdf Run_4.Lcheck6_output.FiniteDiff.txt Run_4.Lcheck6_output.txt

Is the right ConvergenceStudy.m file included in the repo? Because the one I am running seems to test all finitediff's against the analytical output of run 1? l19-22:

    fname_out = ['Run_', num2str(ii) '.Lcheck6_output.FiniteDiff.txt'];
    FG_FiniteDiff = importdata(fname_out);

    FG_analytical = importdata('Run_1.Lcheck6_output.txt');

    diff_abs = abs(FG_analytical - FG_FiniteDiff);

    max_DeltaRel(ii) = max(max(diff_abs)) / max(max(abs(FG_analytical)));

smiet commented 4 years ago

@zhisong I do not see what conflict you are referring to:

@smiet @zhucaoxiang Can you resolve the conflict in read_spec.py?

I don't know if you have kept a clean version of it.

This pull request does not have a conflict with read_spec.py as far as I can see? Am I looking in the wrong place?

arunav2123 commented 4 years ago

Torus_FG_convergence.pdf See the actual Torus_FG_convergence_plot in attached pdf file which i got. @smiet not sure what fishy in your case. May be @abaillod could suggest something .

zhisong commented 4 years ago

@smiet Sorry. I mean compare_spec.py. There are multiple of them.

abaillod commented 4 years ago

@smiet Thank you for testing! I am a bit puzzled by your convergence in the case of the rotating ellipse. I checked the difference between the files you attached and the ones I obtained and the maximum absolute difference is about 1E-10.

So that means that you actually get the right force gradients, but MATLAB is screwing things up and plotting something wrong. Two things we could try:

What is your MATLAB version ? Mine is

>> version

ans =

9.1.0.441655 (R2016b)

Try to import both SPEC force gradient and the finite difference estimate with importdata, and output manually the maximum relative difference.

d1 = importdata('Run_1.Lcheck6_output.txt');
d2 = importdata('Run_4.Lcheck6_output.FiniteDiff.txt');
out = max(max(abs(d1-d2))) / max(max(abs(d1)));

I personally get 1.3622E-08. If you get ~7E-04, my guess is that MATLAB has trouble reading the data.

smiet commented 4 years ago

This is what I get:

>> version

ans =

    '9.8.0.1396136 (R2020a) Update 3'

>> d1 = importdata('Run_1.Lcheck6_output.txt');
d2 = importdata('Run_4.Lcheck6_output.FiniteDiff.txt');
out = max(max(abs(d1-d2))) / max(max(abs(d1)))

out =

   6.7340e-04

I am not sure this is matlab though. When I open the files in a text editor, I get the same numbers as when I print out d1 and d2 in matlab.

Maybe there is something fishy going on in the write? My system was throwing an exception for the screen output, saying something was of the wrong type.

smiet commented 4 years ago

I have found part of the as to why my code crashed previously: FORTRAN's formatted write.

the statement

            write(ounit,1345) myid, im(ii), in(ii), hessian(ii,:)
...
1345       format("dforce: myid=",i3," ; (",i4,",",i4," ; Hessian            = ",64f16.10 "   ;")

was throwing an exception because hessian is longer than 64 (which this write expects). Then (apparently under my compiler) it wraps around and crams the 65'th element of Hessian into the 'i3' write for myid.

My Hessian is length 105. This seems to be correct though from what NGdof should be (OMG that calculation took longer than I expected...)

Is NGdof 105 in your case as well, or is this related to the issue I am having? Could someone share a Lcheck txt file? Also, @abaillod, we could plan a zoom/call to try to work through this last issues and we can get this merger over with!

smiet commented 4 years ago

I have pushed the version with the erroring writes commented out. I do not understand though how this code can be running fine for you however!

Even when I mdoify the print statement, the do loop goes

        do ii=1, NGdof
           write(ounit,1345) myid, im(ii), in(ii), hessian(ii,:)
            write(10   ,1347) hessian(ii,:)
        enddo

NGDOF is clearly larger than mn, so your systems should be accessing im(ii) and in(ii) out-of-bounds too? Am I crazy?

zhucaoxiang commented 4 years ago

I haven't followed the entire conversation. Personal suggestion is to run dspec with more careful checks.

zhisong commented 4 years ago

@abaillod I notice that you have changed the coordinates in coords.f90 for the cylindrical geometry related to #94, but you have not updated that in styxyz.f90. Could you update that as well?

abaillod commented 4 years ago

@smiet, are you sure the file you attached to your message are really the files that are read by Matlab? And that they correspond to the toroidal case? If yes, please try

The relative difference between Run_4.Lcheck6_output.txt and Run_1.Lcheck6_output.FiniteDiff.txt. I get ~1E-8
The relative difference between Run_1.Lcheck6_output.txt and Run_4.Lcheck6_output.txt. If not close to machine precision, please send me the file Run_1.Lcheck6_output.txt so I can have a look...

Otherwise, your matrix size is fine, I also have a 105x105 matrix.

@zhisong ok, I updated stzxyz.f90. Please double check what I did in the last commit 0131898!

Thanks!

zhisong commented 4 years ago

@abaillod I just realize your cylindrical coordinate in coords.f90 is not consistent with your definition of the basis functions. There is a missing power of half. I suggest you revert both coords.f90 and stzxyz.f90 to the master version. I have them properly implemented and compared to the analytical results in the Zernike branch. They will be fixed as soon as I merge the branch.

abaillod commented 4 years ago

@zhisong ok, done.

smiet commented 4 years ago

It is working now! Convergence_Torus

Cause: The error in formatted output caused the write to crash after the first iteration of the loop. This caused a bad file 'Run1.Lcheck6_output.txt` to be created, but not closed. This incorrectly formatted file was left in the folder, and subsequently messed up the comparison.

Why this file was not overwritten, I do not understand, but the problem was caused by this incorrect comparison.

I am still seeing some issues with the Intel compiler (2017) in dspec that I need to have a look at before approving:

xspech :       0.03 : myid=  0 ; calling hesian ; see .ext.hessian.myid ;
forrtl: severe (194): Run-Time Check Failure. The variable 'get_inverse_beltrami_matrices_$CPUO' is being used in 'dfp200_m.F90(2668,4)' without being defined
Image              PC                Routine            Line        Source
dspec              00000000009EE970  get_inverse_beltr        2668  dfp200_m.F90
dspec              00000000009C32C6  dfp200_                  1436  dfp200_m.F90
dspec              0000000000AB9B9C  dforce_                  1034  dforce_m.F90
dspec              0000000000BD9A04  hesian_                   531  hesian_m.F90
dspec              0000000000CE172F  MAIN__                    831  xspech_m.F90
dspec              0000000000419D1E  Unknown               Unknown  Unknown
libc-2.12.so       00007F0E415FBD20  __libc_start_main     Unknown  Unknown
dspec              0000000000419C29  Unknown               Unknown  Unknown
../../../../../dspec ScrewPinch_Nvol3.sp: Signal 66

zhisong commented 4 years ago

@abaillod Another minor problem I've found today:

rpol and rtor in the physics namelist are not mirrored into the hdf5 output.

jonathanschilling commented 4 years ago

The first test works on the draco cluster at IPP. However, I get a difference of 8.7056e-11. This is not exactly machine precision, but I think it is good enough...?

The second test works as well and I get delta = 3.6948e-11.

Both tests were done with 8 MPI tasks on one node. I used the following modules: 1) intel/18.0.5 2) impi/2018.4 3) mkl/2018.4 4) git/2.26 5) fftw-mpi/3.3.8 6) hdf5-serial/1.10.6 7) matlab/R2020a.

The convergence study in slab geometry works as well, I think: Convergence_Slab

The screw pinch convergence study also looks ok; however, I am pretty certain that the x axis is Lrad rather than log10(Lrad) ;-) Convergence_ScrewPinch

The torus case also passes: Convergence_Torus

All the named tests have passed well, so I approve this PR :-) Great work, @abaillod !

zhisong commented 4 years ago

I am going to hit this button since I am desperately waiting for this merge so I can start the pull request for the Zernike branch.

@abaillod forgive me for not letting you do it yourself.

abaillod commented 4 years ago

I am very happy to see that this branch has been merged! @zhisong I am totally fine with you merging the branch, I understand that you needed this merge to continue working on the Zernike branch.

Thank you all for the help and the testing!

PrincetonUniversity / SPEC

Force gradient current constraint #105