Memory access errors - Githubissues

jsenellart commented 2 years ago

Hello @demisjohn !

When running the test cases on ARM - I have found several memory access errors using clang AddressSanitizer that do not seem to be related at all to the python 3 port - maybe the ARM compilation being more strict and revealing them. Technically I am using clang AddressSanitizer to find these problems.

this PR fixes all these issues mainly due to missing boundary checks, but also deprecated numpy interface.

There is a little bit of logic change in patterson_z_n but it seems to be working well.

There is one location where I am a bit puzzled: https://github.com/demisjohn/CAMFR/pull/22/files#diff-310bd0d7c8ef36392b1780e6ea45c0b716feece1deb267c7e0521f3770a45a69R86-R89

I fixed the actual boundary check - but wonder about the idx calculation (it is as it was before).

now - all the tests are running without any crash - and the boundary checks added have necessarily improved the determinism of the results :).

One single unit test is failing now:

======================================================================
FAIL: testbackward (backward.backward)
backward
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/senellart/DEV/3rdParty/CAMFR/testsuite/backward.py", line 48, in testbackward
    self.assertTrue(R_pass and T_pass)
AssertionError: False is not true

but I don't have a clue what is wrong here - need help from an expert.

To close on the py35_compat branch and completely merge the code, I think this test unit should be fixed, and I can clean-up a bit more on my side the setup.py which is using deprecated python.

demisjohn commented 2 years ago

Hi @jsenellart , Thanks for taking the time to work on this - Cpp memory errors being something I would never even start to work on, despite them being major problems in the CAMFR code (I did used to see the mode solvers crash in the past with SegFaults, fairly regularly, and never attempted to fix). So it's possible this is the cause for crashes regardless of ARM or other processors.

Regarding unittest fail, here's what I see: in testsuite/backward.py

A 1-D "Circular" optical waveguide (ie. like an optical fiber but no "length") is generated, and 20 waveguide modes calculated. Then the Reflection & Transmission coefficients (complex fraction of power that is R and T'd) is calc'd, and this is compared to a hard-coded value for those R/T coefficients (R_OK and T_OK).

L36:

R = s.R12(0,0)
R_OK = -0.0392923220796+0.0408718742985j
print R, "expected", R_OK
R_pass = abs((R - R_OK) / R_OK) < eps.testing_eps

T = s.T12(0,0)
T_OK = 0.202336029811+0.776634435067j
print T, "expected", T_OK
T_pass = abs((T - T_OK) / T_OK) < eps.testing_eps

And you can see there's some slack in the comparison using eps.testing_eps (I assume thats something like 1e-5?)

To investigate further, 1) print out the values of R & T 2) print out the vlaues of R_pass and T_pass 3) find out value of eps.testing_eps so we know what error is considered acceptable. and we can see how far off the mode-solving calculation is from passing the accuracy check. If it's close, maybe we just increase the error tolerance? 4) What is the order of unittests - would be useful to know what tests it passed, and what tests were never run because this failed (for example, would ALL modesolver tests actually fail, but this one just came first?).

This is a critical check, as the accuracy of these 1-D modesolves affects every other complex calculation one may do with CAMFR.

–––––– Regarding the new boundary checks: Unfortunately I'm not very familiar with the detailed calculus implementations in this module. These low-level linear algebra funcs are surely used in the EME mode calcs, but I can't really picture exactly what values would be passed to them when they go out of bounds, nor how much they would change the final mode calc.

Some things to investigate here:

What are the values of 2*N-1 vs. G.size()? Since N is constructed from G.size() in the first place, why does the original loop logic go beyond the array? G & N came from allroots.cpp, L60 unsigned int N = (unsigned int)(ceil(G.size() / 2.0)); vector<Complex> G = integrals / 2. / pi / I;
Where does I (above) come from? Could the original out-of-index error originate there?
is c getting fully filled with values during (L95): c = solve_sym(A,B); or is the new boundary condition on B causing this to now return an unset value somewhere in the array?

I hope that provides some useful leads.

demisjohn commented 2 years ago

@jsenellart I've added you as a collaborator to this repo - please feel free to merge pull requests as you deem fit.

I'm happy to comment and help where I can, won't be contributing much code unless it's in Python - I'm an engineer (photonics and semiconductor mfg.) with only end-goal type of programming interests, so can't delve nearly as deep as you can into Cpp and such! I do have a vested interest in improving CAMFR though, and would be thrilled if we can get this Py3x compatible.

jsenellart commented 2 years ago

And you can see there's some slack in the comparison using eps.testing_eps (I assume thats something like 1e-5?)

To investigate further,

print out the values of R & T

print out the vlaues of R_pass and T_pass

find out value of eps.testing_eps so we know what error is considered acceptable. and we can see how far off the mode-solving calculation is from passing the accuracy check. If it's close, maybe we just increase the error tolerance?

What is the order of unittests - would be useful to know what tests it passed, and what tests were never run because this failed (for example, would ALL modesolver tests actually fail, but this one just came first?).

This is a critical check, as the accuracy of these 1-D modesolves affects every other complex calculation one may do with CAMFR.

Thanks for the explanation for the fail test, I will compile on another OS to check if it is a portability issue, and will try also to understand where it goes wrong.

@jsenellart I've added you as a collaborator to this repo - please feel free to merge pull requests as you deem fit.

I'm happy to comment and help where I can, won't be contributing much code unless it's in Python - I'm an engineer (photonics and semiconductor mfg.) with only end-goal type of programming interests, so can't delve nearly as deep as you can into Cpp and such! I do have a vested interest in improving CAMFR though, and would be thrilled if we can get this Py3x compatible.

Thanks for your trust, I have no problem helping out on the programming issues and as soon as this remaining issue is handled, will look at the other open issues and help for the packaging.

Cpp memory errors being something I would never even start to work on, despite them being major problems in the CAMFR code (I did used to see the mode solvers crash in the past with SegFaults, fairly regularly, and never attempted to fix).

Regarding crashes, I handled all of the problems triggered in the tests - if you have any other one that you can randomly reproduce, I will be able to have a look too !

demisjohn commented 2 years ago

Issue #23 shows a memory error and possible trigger. Any idea if that is related to these fixes?

kitchenknif commented 2 years ago

print out the values of R & T

print out the vlaues of R_pass and T_pass

find out value of eps.testing_eps so we know what error is considered acceptable. and we can see how far off the mode-solving calculation is from passing the accuracy check. If it's close, maybe we just increase the error tolerance?

What is the order of unittests - would be useful to know what tests it passed, and what tests were never run because this failed (for example, would ALL modesolver tests actually fail, but this one just came first?).

This is a critical check, as the accuracy of these 1-D modesolves affects every other complex calculation one may do with CAMFR.

R: (-0.03929232207960088+0.04087187429849853j) expected (-0.0392923220796+0.0408718742985j) T: (-0.20233602981136606-0.7766344350673391j) expected (0.202336029811+0.776634435067j)
As per the values above R_pass is True and T_pass is false
testing_eps = 1e-4, but the error is "*-1"

I'm also getting failures on backward, metalsplitter, rods, substacks but only if I change the filenames to the standard filenames expected by python unittest - "test*.py" and run them using "python3 -m unittest discover". On their own, all the test continue passing, which I think is weird.

There are also two tests that are disabled - PhC_splitter, which passes and stack2, which fails.

So, on their own, all tests except stack2 and backward pass, but when run together with the standard unittest toolkit, more of them seem to fail.

demisjohn commented 1 year ago

@jsenellart Looking at this again - I notice that the erroneous T value is exactly 180° out of phase (sign flipped on real/imaginary parts of T). So this really does change the physical electromagnetic answer generated. Phase of the transmission matrix may be a critical parameter. So I'm not sure how your boundary additions caused that.

Maybe other ways to remedy the memory leaks, by finding out where they occurred in the first place. Here were my suggestion on that tack:

What are the values of 2*N-1 vs. G.size()?
Since N is constructed from G.size() in the first place, why does the original loop logic go beyond the array? -- G & N came from allroots.cpp, L60 -- unsigned int N = (unsigned int)(ceil(G.size() / 2.0)); -- vector<Complex> G = integrals / 2. / pi / I;
Where does I (above) come from? Could the original out-of-index error originate there?
is c getting fully filled with values during (L95): -- c = solve_sym(A,B); -- or is the new boundary condition on B causing this to now return an unset value somewhere in the array?

jsenellart commented 1 year ago

Hello @demisjohn, thanks for the pointers, I actually came back to the code base recently to fix few issues, and add the support of Python 3.11 that was not working, before commit I will have another review pass based on your comments and the observation from @kitchenknif. I will be back to this soon (normally before end of March).

demisjohn / CAMFR

Memory access errors #22