MethodicalAcceleratorDesign / MAD-NG

MAD Next-Generation official repository
GNU General Public License v3.0
32 stars 11 forks source link

dev-tpsa-new issues #431

Closed mattsignorelli closed 7 months ago

mattsignorelli commented 7 months ago

Running list of issues I find with dev-tpsa-new:

With DESC_USE_TMP=0, all my tests with allocated tpsas pass. Making the fix in #434 , all of my tests with DESC_USE_TMP=1 using both the temporaries and allocated tpsas pass

mattsignorelli commented 7 months ago

I have updated my comment on mul with the more specific finding. Still tracking down the seg fault...

ldeniau commented 7 months ago

About the seg fault, have you checked the mad_[c]tpsa_nam function? setnam has been replaced by new nam which is doing both.

mattsignorelli commented 7 months ago

Yes I have checked that. Something seems to be happening silently, because the last two bugs on the checklist only occur sometimes, not every time

Nevermind, those are happening every time it seems, I mixed myself up

ldeniau commented 7 months ago

One point that has changed in the internal semantic is that lo bound doesn't include non-zero scalar part anymore, because when manipulating high-order specific maps around an orbit, all the intermediate orders were processed while filled with zeros. The drawback of this speedup is that coef[0] must always be treated separately (internally).

mattsignorelli commented 7 months ago

I see, that makes sense.

I'm finding the seg fault to appear in cot, however this one seems to only occur when I run all tests before it. When only running the lone cot test, everything seems okay

mattsignorelli commented 7 months ago

Here is the specific output:

[43550] signal (11.1): Segmentation fault
in expression starting at REPL[1]:1
mad_tpsa_inv at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_div at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_cot at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)

libgtpsa.so is the compiled C library

ldeniau commented 7 months ago

mad_tpsa_inv is calling mad_tpsa_scl if v != 1, so before the previous fix, the scalar part was removed. However, I don't see how this would trigger a seg fault...

mattsignorelli commented 7 months ago

Getting closer... something about calls to polar beforehand is related, other functions may be related too, still investigating

ldeniau commented 7 months ago

Getting closer... something about calls to polar beforehand is related, other functions may be related too, still investigating

I don't think the problem is coming from a high-level function (mad_tpsa_fun) where I didn't do anything because these functions rely on lower-level functions. I suspect more from some corruption of lo, hi or nz beforehand. This is what I observed with DEBUG=2, one of the mul is corrupted during tracking e.g. the LHC. [...] -> mad_tpsa_mul:442: -> mad_tpsa_update:281: mad_tpsa_update:287: 't' { lo=64 hi=255 mo=4 uid=0, did=1 nz=00000 ** bug @ o=0 i=-1 }

adjust0 was broken for nz==0 and coef[0]!=0, fixed in 231a8521

mattsignorelli commented 7 months ago

Ok, perhaps it is one of the low level functions that both polar and cot call within inv

If polar is called three times with tpsa t = 1+x, I get this:

[75324] signal (11.1): Segmentation fault
in expression starting at REPL[6]:1
mad_tpsa_inv at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_div at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_atan2 at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_ctpsa_polar at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)

this is the minimal working example I found. This has to be done with the same tpsa each call

ldeniau commented 7 months ago

moving to the next debug step, the cdamap used in normal forms after tracking through the HL-LHC gets incorrect input values for minv somehow (with DEBUG=2, no corrupted TPSA is detected):

-> mad_ctpsa_minv:112:
error: mad_tpsa_minv.c:118: : invalid rank-deficient map (1st order has zero row)
../mad: 
stack traceback:
    [C]: in function 'mad_ctpsa_minv'
    madl_damap.mad:594: in function '__pow'
    madl_gphys.mad:1322: in function 'normal'
    madl_twiss.mad:413: in function 'twiss_nform'
    madl_twiss.mad:563: in function 'make_mflow'
    madl_twiss.mad:585: in function 'twiss'
ldeniau commented 7 months ago

Ok, perhaps it is one of the low level functions that both polar and cot call within inv

If polar is called three times with tpsa t = 1+x, I get this:

[75324] signal (11.1): Segmentation fault
in expression starting at REPL[6]:1
mad_tpsa_inv at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_div at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_atan2 at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_ctpsa_polar at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)

this is the minimal working example I found. This has to be done with the same tpsa each call

I cannot reproduce the problem, even with 20 calls... Here is the sequence of calls for a single polar, which works only with ctpsa, not tpsa.

> t:print()
-> mad_ctpsa_print:391:

 -UNNAMED-:  C, NV =   6, MO =  1
 ******************************************************************************
     I   COEFFICIENT                                      ORDER   EXPONENTS
     1   1.0000000000000000E+00 +0.0000000000000000E+00i    0     0 0  0 0  0 0
<- mad_ctpsa_print:436:
> MAD.gmath.polar(t,t)
-> mad_ctpsa_polar:139:
-> mad_tpsa_new:191:
-> mad_tpsa_init:169:
<- mad_tpsa_init:172:
<- mad_tpsa_new:197:
-> mad_tpsa_new:191:
-> mad_tpsa_init:169:
<- mad_tpsa_init:172:
<- mad_tpsa_new:197:
-> mad_tpsa_new:191:
-> mad_tpsa_init:169:
<- mad_tpsa_init:172:
<- mad_tpsa_new:197:
-> mad_ctpsa_real:27:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_ctpsa_real:35:
-> mad_ctpsa_imag:50:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_ctpsa_imag:58:
-> mad_tpsa_hypot:786:
-> mad_tpsa_axypbvwpc:878:
-> mad_tpsa_new:191:
-> mad_tpsa_init:169:
<- mad_tpsa_init:172:
<- mad_tpsa_new:197:
-> mad_tpsa_mul:442:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_mul:505:
-> mad_tpsa_mul:442:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_mul:505:
-> mad_tpsa_axpbypc:822:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_axpbypc:830:
-> mad_tpsa_del:203:
<- mad_tpsa_del:205:
<- mad_tpsa_axypbvwpc:886:
-> mad_tpsa_sqrt:212:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_sqrt:219:
<- mad_tpsa_hypot:792:
-> mad_tpsa_atan2:592:
-> mad_tpsa_div:511:
-> mad_tpsa_scl:267:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_scl:277:
<- mad_tpsa_div:523:
-> mad_tpsa_atan:776:
-> mad_tpsa_setval:273:
<- mad_tpsa_setval:275:
<- mad_tpsa_atan:781:
<- mad_tpsa_atan2:609:
-> mad_ctpsa_cplx:73:
-> mad_ctpsa_setval:273:
<- mad_ctpsa_setval:275:
<- mad_ctpsa_cplx:82:
-> mad_tpsa_del:203:
<- mad_tpsa_del:205:
-> mad_tpsa_del:203:
<- mad_tpsa_del:205:
<- mad_ctpsa_polar:147:
> t:print()
-> mad_ctpsa_print:391:

 -UNNAMED-:  C, NV =   6, MO =  1
 ******************************************************************************
     I   COEFFICIENT                                      ORDER   EXPONENTS
     1   1.0000000000000000E+00 +0.0000000000000000E+00i    0     0 0  0 0  0 0
<- mad_ctpsa_print:436:
mattsignorelli commented 7 months ago

Ok I found one problem: I am compiling with DESC_USE_TMP = 1, and I see that after a call to polar, ti in the descriptor is 1. Another call to polar and it is 2. Then segfault on the third call. So one of the temporaries used internally by polar (or one of its internal calls) is not being released atleast with DESC_USE_TMP=1

If the same ctpsa is used consecutively, then it seg faults on the third call. If other ctpsa's are called inbetween, it seg faults on the 6th call. This happens in tests where I am not using temporaries, so the handling is internal in the polar call

mattsignorelli commented 7 months ago

Ok, perhaps it is one of the low level functions that both polar and cot call within inv If polar is called three times with tpsa t = 1+x, I get this:

[75324] signal (11.1): Segmentation fault
in expression starting at REPL[6]:1
mad_tpsa_inv at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_div at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_tpsa_atan2 at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_ctpsa_polar at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)

this is the minimal working example I found. This has to be done with the same tpsa each call

I cannot reproduce the problem, even with 20 calls... Here is the sequence of calls for a single polar, which works only with ctpsa, not tpsa.

I will try again now compiling with DESC_USE_TMP=0

mattsignorelli commented 7 months ago

All my tests with allocated tpsas pass with the latest dev-tpsa-new for DESC_USE_TMP = 0

With DEBUG=2 and DESC_USE_TMP = 1, calling polar just once aborts with the following output:

-> mad_ctpsa_polar:139:
-> mad_ctpsa_real:27:
<- mad_ctpsa_real:44:
-> mad_ctpsa_imag:50:
<- mad_ctpsa_imag:67:
-> mad_tpsa_hypot:786:
-> mad_tpsa_axypbvwpc:878:
-> mad_tpsa_mul:442:
-> mad_tpsa_setval:277:
<- mad_tpsa_setval:279:
<- mad_tpsa_mul:505:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_axpbypc:822:
<- mad_tpsa_axpbypc:844:
<- mad_tpsa_axypbvwpc:886:
-> mad_tpsa_sqrt:212:
-> mad_tpsa_copy:299:
<- mad_tpsa_copy:316:
-> mad_tpsa_scl:267:
<- mad_tpsa_scl:282:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
<- mad_tpsa_sqrt:233:
<- mad_tpsa_hypot:792:
-> mad_tpsa_atan2:592:
-> mad_tpsa_div:511:
-> mad_tpsa_inv:158:
-> mad_tpsa_copy:299:
<- mad_tpsa_copy:316:
-> mad_tpsa_scl:267:
<- mad_tpsa_scl:282:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
-> mad_tpsa_mul:442:
-> mad_tpsa_update:285:
<- mad_tpsa_update:291:
<- mad_tpsa_mul:505:
-> mad_tpsa_acc:288:
<- mad_tpsa_acc:311:
<- mad_tpsa_inv:178:
-> mad_tpsa_mul:442:
-> mad_tpsa_setval:277:
<- mad_tpsa_setval:279:
-> mad_tpsa_copy:299:
<- mad_tpsa_copy:316:
<- mad_tpsa_mul:505:
<- mad_tpsa_div:531:
-> mad_tpsa_atan:776:
-> mad_tpsa_setval:277:
<- mad_tpsa_setval:279:
<- mad_tpsa_atan:781:
<- mad_tpsa_atan2:609:
-> mad_ctpsa_cplx:73:
<- mad_ctpsa_cplx:95:
julia: /home/matt/tpsa/gtpsa/code/mad_tpsa_impl.h:202: mad_tpsa_reltmp: Assertion `d->t[ tid*DESC_MAX_TMP + d->ti[tid]-1 ] == tmp' failed.

[159510] signal (6.-6): Aborted
in expression starting at REPL[5]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f3414c1871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
mad_tpsa_reltmp at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_ctpsa_polar at /home/matt/.julia/dev/GTPSA_jll/override/lib/libgtpsa.so (unknown line)
mad_ctpsa_polar! at /home/matt/.julia/dev/GTPSA/src/low_level/ctpsa.jl:556 [inlined]
mattsignorelli commented 7 months ago

Ok I found the problem and submitted a PR. One temporary in polar wasn't being released

All my tests pass now, I do not have tests yet for any of the map methods.

ldeniau commented 7 months ago

I close this issue for now.