Closed abaillod closed 4 years ago
I have just pushed a change to the branch because the code was not compiling with GCC 10.
The error was a mismatch between calls to dgesvx
(LAPACK routine) in line 679 and line 693.
berr and ferr were passed as integers in the first call, whereas the type specification requests a variable sized array.
Previously GCC allowed this loosey-goosey approach, but in 10 they started strictly enforcing and not allowing an integer to substitute for a length 1 array.
The change is in line 117 which now reads:
REAL :: Rdgesvx(1:NN), Cdgesvx(1:NN), work4(1:4*NN), rcond, ferr(1), berr(1), ferr2(1:2), berr2(1:2)
i.e. ferr
and berr
are lenght 1 arrays. This should not break anything else, and adheres better to the FORTRAN specification.
Just out of curiosity, in that line ferr2 and berr2 are initialized as lenght 2 arrays, but passed as ferr2(1:MM). Is MM equal to 2, or are we passing memory beyond that allocated for the variable, and possibly overwriting other memory with this forward error array?
Thank you @smiet for catching that. However the automatic testing are not successful anymore - any idea what caused that?
Regarding ferr2 and berr2, you see at line 686 MM=2
, so when using ferr2(1:MM)
and berr2(1:MM)
at lines 695 it doesn't access memory outside what has been allocated.
The automatic testing has not failed, but has stopped running for some reason. They are upgrading the cluster, and therefore possibly the runner has stopped working. I will try to get this fixed ASAP
I am trying to replicate the tests, but I am using the SPEC matlab for the first time. and am having issues. Seems many routines rely on having read_hdf5.m in the path, but that file is nowhere in the repository.
(Is it in a repository maintained by S. Lazerson?) read_spec.m seems to work similarly. I get a delta of 8*10E-11, not exactly machine precision, but also not huge.
Sorry, my bad about the matlab routines. _read_spechessian was outdated - I committed a new version and it should work now.
The first two tests are working well for me, and I get the same force gradient!
Unfortunately I am running into one more issue, when I the last convergence study that you mention, I am so sorry!
Slab works well, but when I run the screwpinch and torus cases, I get fatal errors in running dspec! I guess we haven't been testing dspec as well. The error does not pop up if I run the .sp file with xspec
The toroidal case fails on a write statement in dforce: (what on EARTH is going on?) Torus_dspec.log Torus_matlab.log line 767 in dforce is where it goes wrong
Whereas the screwpinch tries to access an element outside of the array dbdx2 in dfp200: Screwpinch_matlab.log
@smiet Specifically comment out these lines
write(ounit,1345) im(ii), in(ii), hessian(ii,:) write(ounit,1346) im(ii), in(ii), finitediff_hessian(ii,:) 1345 format("dforce: ; (",i4,",",i4,") ; Hessian = ",641F16.10 " ;") 1346 format("dforce: ; (",i4,",",i4,") ; Finite differences = ",641F16.10 " ;")
These are just for screen outputs. Outputs are saved in a file for comparison.
@smiet Also, to solve screwpinch_matlab.log problem ,please replace line 3603 in dfp200_m.f90 or 987 in dfp200.F90 by
if (vvol.lt.Mvol-1) then ; dBdx2(vvol+1) = Bt00(vvol+1, 1, -1) endif
Let me know, if this work .
@smiet @zhucaoxiang Can you resolve the conflict in read_spec.py?
I don't know if you have kept a clean version of it.
@zhisong We can easily revert it. But I think we need probably a re-organize and clean. Right now, there are too many scripts that are repetitive. I will find a time to discuss with @smiet .
@zhucaoxiang Sure. Let me know as well.
@arunav2123 We should probably comment out those lines in the files that we will merge as well. Though I cannot understand why those lines would not throw exceptions when compiled with on your systems.
This now works for the slab and the screw pinch: Convergence_Slab.pdf Convergence_ScrewPinch.pdf
Though the torus is still giving me trouble. It exits on a FATAL, but I believe that is intended? Nevertheless, though it does seem to run to it's intended completion, it does not converge. I am attaching the convergence plot as well as the generated files for Run_4, as a sanity check. Convergence_Torus.pdf Run_4.Lcheck6_output.FiniteDiff.txt Run_4.Lcheck6_output.txt
Is the right ConvergenceStudy.m file included in the repo? Because the one I am running seems to test all finitediff's against the analytical output of run 1? l19-22:
fname_out = ['Run_', num2str(ii) '.Lcheck6_output.FiniteDiff.txt'];
FG_FiniteDiff = importdata(fname_out);
FG_analytical = importdata('Run_1.Lcheck6_output.txt');
diff_abs = abs(FG_analytical - FG_FiniteDiff);
max_DeltaRel(ii) = max(max(diff_abs)) / max(max(abs(FG_analytical)));
@zhisong I do not see what conflict you are referring to:
@smiet @zhucaoxiang Can you resolve the conflict in read_spec.py?
I don't know if you have kept a clean version of it.
This pull request does not have a conflict with read_spec.py as far as I can see? Am I looking in the wrong place?
Torus_FG_convergence.pdf See the actual Torus_FG_convergence_plot in attached pdf file which i got. @smiet not sure what fishy in your case. May be @abaillod could suggest something .
@smiet Sorry. I mean compare_spec.py. There are multiple of them.
@smiet Thank you for testing! I am a bit puzzled by your convergence in the case of the rotating ellipse. I checked the difference between the files you attached and the ones I obtained and the maximum absolute difference is about 1E-10.
So that means that you actually get the right force gradients, but MATLAB is screwing things up and plotting something wrong. Two things we could try:
>> version
ans =
9.1.0.441655 (R2016b)
importdata
, and output manually the maximum relative difference.d1 = importdata('Run_1.Lcheck6_output.txt');
d2 = importdata('Run_4.Lcheck6_output.FiniteDiff.txt');
out = max(max(abs(d1-d2))) / max(max(abs(d1)));
I personally get 1.3622E-08. If you get ~7E-04, my guess is that MATLAB has trouble reading the data.
This is what I get:
>> version
ans =
'9.8.0.1396136 (R2020a) Update 3'
>> d1 = importdata('Run_1.Lcheck6_output.txt');
d2 = importdata('Run_4.Lcheck6_output.FiniteDiff.txt');
out = max(max(abs(d1-d2))) / max(max(abs(d1)))
out =
6.7340e-04
I am not sure this is matlab though. When I open the files in a text editor, I get the same numbers as when I print out d1 and d2 in matlab.
Maybe there is something fishy going on in the write? My system was throwing an exception for the screen output, saying something was of the wrong type.
I have found part of the as to why my code crashed previously: FORTRAN's formatted write.
the statement
write(ounit,1345) myid, im(ii), in(ii), hessian(ii,:)
...
1345 format("dforce: myid=",i3," ; (",i4,",",i4," ; Hessian = ",64f16.10 " ;")
was throwing an exception because hessian is longer than 64 (which this write expects). Then (apparently under my compiler) it wraps around and crams the 65'th element of Hessian into the 'i3' write for myid.
My Hessian is length 105. This seems to be correct though from what NGdof should be (OMG that calculation took longer than I expected...)
Is NGdof 105 in your case as well, or is this related to the issue I am having? Could someone share a Lcheck txt file? Also, @abaillod, we could plan a zoom/call to try to work through this last issues and we can get this merger over with!
I have pushed the version with the erroring writes commented out. I do not understand though how this code can be running fine for you however!
Even when I mdoify the print statement, the do loop goes
do ii=1, NGdof
write(ounit,1345) myid, im(ii), in(ii), hessian(ii,:)
write(10 ,1347) hessian(ii,:)
enddo
NGDOF is clearly larger than mn, so your systems should be accessing im(ii) and in(ii) out-of-bounds too? Am I crazy?
I haven't followed the entire conversation. Personal suggestion is to run dspec
with more careful checks.
@abaillod I notice that you have changed the coordinates in coords.f90
for the cylindrical geometry related to #94, but you have not updated that in styxyz.f90
. Could you update that as well?
@smiet, are you sure the file you attached to your message are really the files that are read by Matlab? And that they correspond to the toroidal case? If yes, please try
Run_4.Lcheck6_output.txt
and Run_1.Lcheck6_output.FiniteDiff.txt
. I get ~1E-8Run_1.Lcheck6_output.txt
and Run_4.Lcheck6_output.txt
. If not close to machine precision, please send me the file Run_1.Lcheck6_output.txt
so I can have a look...Otherwise, your matrix size is fine, I also have a 105x105 matrix.
@zhisong ok, I updated stzxyz.f90
. Please double check what I did in the last commit 0131898!
Thanks!
@abaillod I just realize your cylindrical coordinate in coords.f90
is not consistent with your definition of the basis functions. There is a missing power of half. I suggest you revert both coords.f90
and stzxyz.f90
to the master version. I have them properly implemented and compared to the analytical results in the Zernike
branch. They will be fixed as soon as I merge the branch.
@zhisong ok, done.
It is working now!
Cause: The error in formatted output caused the write to crash after the first iteration of the loop. This caused a bad file 'Run1.Lcheck6_output.txt` to be created, but not closed. This incorrectly formatted file was left in the folder, and subsequently messed up the comparison.
Why this file was not overwritten, I do not understand, but the problem was caused by this incorrect comparison.
I am still seeing some issues with the Intel compiler (2017) in dspec that I need to have a look at before approving:
xspech : 0.03 : myid= 0 ; calling hesian ; see .ext.hessian.myid ;
forrtl: severe (194): Run-Time Check Failure. The variable 'get_inverse_beltrami_matrices_$CPUO' is being used in 'dfp200_m.F90(2668,4)' without being defined
Image PC Routine Line Source
dspec 00000000009EE970 get_inverse_beltr 2668 dfp200_m.F90
dspec 00000000009C32C6 dfp200_ 1436 dfp200_m.F90
dspec 0000000000AB9B9C dforce_ 1034 dforce_m.F90
dspec 0000000000BD9A04 hesian_ 531 hesian_m.F90
dspec 0000000000CE172F MAIN__ 831 xspech_m.F90
dspec 0000000000419D1E Unknown Unknown Unknown
libc-2.12.so 00007F0E415FBD20 __libc_start_main Unknown Unknown
dspec 0000000000419C29 Unknown Unknown Unknown
../../../../../dspec ScrewPinch_Nvol3.sp: Signal 66
@abaillod Another minor problem I've found today:
rpol
and rtor
in the physics namelist are not mirrored into the hdf5 output.
The first test works on the draco
cluster at IPP. However, I get a difference of 8.7056e-11
. This is not exactly machine precision, but I think it is good enough...?
The second test works as well and I get delta = 3.6948e-11
.
Both tests were done with 8 MPI tasks on one node. I used the following modules:
1) intel/18.0.5 2) impi/2018.4 3) mkl/2018.4 4) git/2.26 5) fftw-mpi/3.3.8 6) hdf5-serial/1.10.6 7) matlab/R2020a
.
The convergence study in slab geometry works as well, I think:
The screw pinch convergence study also looks ok; however, I am pretty certain that the x axis is Lrad
rather than log10(Lrad)
;-)
The torus case also passes:
All the named tests have passed well, so I approve this PR :-) Great work, @abaillod !
I am going to hit this button since I am desperately waiting for this merge so I can start the pull request for the Zernike branch.
@abaillod forgive me for not letting you do it yourself.
I am very happy to see that this branch has been merged! @zhisong I am totally fine with you merging the branch, I understand that you needed this merge to continue working on the Zernike branch.
Thank you all for the help and the testing!
Issue #103 has been solved and I am now confident that the implementation of the force gradient in the case of
Lconstraint=3
is correct. We can thus now start testing the branch ForceGradient_CurrentConstraint on other machines, and hopefully merge soon!I designed a few tests:
You should obtain a delta close to machine precision, whether you run on a single or multiple CPUs.
Second, check that the force gradient in the case of Lconstraint=1 remain unchanged. I placed in the directory _/ci/G2V32L1FiFG/ a file compare.sp.h5 and its force gradient .compare.GF.ma, obtained with the master branch. Run the case G2V32L1Fi.sp and check that you get the same force gradient, following the same steps as above. On my machine, I get
delta~1E-11
Last, check the convergence (if you have the time, this is not necessary). First, compile SPEC in debug mode. Then, into the directory InputFiles/Verification/currentconstraint/ForceGradient/, you can find three directories:
in any of those directories, you can run the command
and a convergence plot should be generated. In slab, you observe a convergence till machine precision, in the screw pinch case a convergence up to 1E-14 and in the toroidal case up to 1E-12.
I think that if the two first points work without problem at ANU, IPP, SPC and PPPL, we are safe for merging. Please report here any bug, I am happy to solve it!