Closed amolaeinezhad closed 11 months ago
Thanks for reporting this and for tracking down the underlying problem; I'm sure that was some tedious debugging.
If I understand the bug correctly, this case in _trapz_rebin_1d
should only occur if the last edge is outside the range of the input x, i.e. edges[-1] > x[-1]
. This is ill-defined for what the trapz integration of the last bin should be since it covers wavelengths outside of the data, and the non-jitted trapz_rebin
wrapper treats this as an error case with
if (edge_min < xmin or edge_max > xmax):
raise ValueError('edges must be within input x range')
By default xmin = x[0]*(1+myz.max())
and xmax = x[-1]*(1+myz.min())
and I don't think this error condition should occur, but if the user provides different xmin
, xmax
, edge_min
, edge_max
(as an optimization) then those are trusted and the chain trapz_rebin
-> _trapz_rebin_batch(x, y, edges, myz, ...)
-> _trapz_rebin_1d(redshifted_x, ...)
could violate the edge condition.
@amolaeinezhad could you explain more about what your rare case is that is triggering this bug? It appears that your proposed fix would avoid the segfault, but also silently not calculate the last ill-defined bin when I think it should be more of an explicit error case if we are asking trapz_rebin
to integrate outside of the bounds of the data. Or if there is a case that I've missed that triggers this problem even when edges are within the data, please explain that. Thanks.
Hi @sbailey. Thanks for getting back to me so quickly. I wanted to dig into this bug a bit more. When we run Redrock with the WD template on a low signal-to-noise WEAVE spectrum (R=5000) in both the Blue (3800.0-5800.0 Å) and Red (5901.0-9282.0 Å) arms, everything goes smoothly until it hits the "fitz" function. Then, things get a bit wild. Check out what we get from this line of code:
zmin, sigma, chi2min, zwarn = minfit(zz[i-1:i+2], zzchi2[i-1:i+2])
The values returned are as follows:
As you can see, these results are way out of bounds, and the code should ideally catch this during the try-catch part, but it just cruises on. Now, it goes through "rebin_template" and then "trapz_rebin." Here's where it should be checking for the "edges[-1] > x[-1]" condition, just as you mentioned, with code like this:
if (edge_min < xmin or edge_max > xmax):
raise ValueError('edges must be within the input x range')
However, in the "trapz_rebin" function, the check (the "edges[-1] > x[-1]" condition via "if edge_min < xmin or edge_max > xmax") only applies to the unredshifted x and therefore it happily skips along. In fact, if in the code xmin or xmax has been set to "None," the code automatically applies redshift to them but not if xmin or xmax are already given by the parent function. In another word, as xmin and xmax have been provided by the parent function, the code runs the edges[-1] > x[-1] condition on the unredshifted wave.
Now, it all starts to fall apart in the next stage, where we dive into "_trapz_rebin_batch" or "_trapz_rebin_1d." These functions deal with redshifted x, and they don't meet the "edges[-1] > x[-1]" condition. This results eventually an index error in the while conditions (or segmentation fault in case of enabling jit)
So, here's the fix suggestion: check the "edges[-1] > x[-1]" condition on the redshifted x. This should keep things in line. Or somehow better handling the out-out-the-bands initial redshift guess by minfit in the very beginning step by adding the IndexError to the exception part in fitz function:
except (ValueError, IndexError) as err:
if zmin<redshifts[0] or redshifts[-1]<zmin:
#- beyond redshift range can be invalid for template
coeff = np.zeros(template.nbasis)
zwarn |= ZW.Z_FITLIMIT
zwarn |= ZW.BAD_MINFIT
else:
#- Unknown problem; re-raise error
raise err
I hope this helps clarify the situation, and if you need more info or anything else, feel free to holler.
Cheers,
Thanks for the further details. I absolutely agree that there is a problem, but I want to make sure that we solve the core underlying problem. A few comments:
In your return values from fitz.py L213
zmin, sigma, chi2min, zwarn = minfit(zz[i-1:i+2], zzchi2[i-1:i+2])
there is zwarn=1024 which is the flag for BAD_MINFIT. i.e. Redrock already knows at that point that it's initial estimate for that minimum is way off and it arguably shouldn't even bother entering the try/except block to refine it. As a side note, it might be interesting for you to inspect zzchi2 vs. zz to see why bin i was selected as a minimum even though the parabola fit is super bad. But that's admittedly not the core problem -- even with a bad zmin, Redrock shouldn't hit a range error segfault anyway.
So, here's the fix suggestion: check the "edges[-1] > x[-1]" condition on the redshifted x.
Agreed, but there are already several places where it is attempting to do that so I want to understand why those attempts aren't working. e.g. within trapzrebin.py rebin_template L490-L493:
if (xmin is None):
xmin = template.minwave*(1+myz.max())
if (xmax is None):
xmax = template.maxwave*(1+myz.min())
i.e. it's trying to consider the x min/max range that can be covered by the template wavelengths given the input redshifts myz
.
I suspect the bug (or at least a bug) is back in fitz.py fitz L221-L222:
xmin = template.minwave*(1+zmin)
ximax = template.maxwave*(1+zmin)
Note ximax
instead of xmax
, i.e. xmax
isn't getting updated for the new zmin
, and then it is getting passed into rebin_template
which is preventing it from updating its xmax
internally, so it proceeds with the wrong xmax
which could cause the indexing error.
Could you try fixing that L222 ximax
typo in your local copy and rerunning to see if that fixes the problem? If not, we'll need to dig deeper to see why the existing redshifted-x checks aren't working (that might already be clear to you, but it isn't yet clear to me). Thanks.
Thank you for providing a solution! I've implemented the fix by correcting the typo (ximax--> xmax) in fitz.py at lines 221-222, and indeed, the segmentation fault issue seems to be resolved.
Fixed in PR #258. Thanks for the report and the patience debugging the root cause.
Issue Summary: Segmentation Fault in Redrock when running on WEAVE data
Description: In rare instances, while running Redrock on WEAVE data -especially in very low SNR spectra-, a segmentation fault occurs, originating from the rebin._trapz_rebin_1d function. A detailed investigation (after disabling the numb.jit and isolating the problematic cases and running the code in non-parallel mode) revealed that the issue lies in the following lines of code in rebin._trapz_rebin_1d():
and
In some very uncommon cases, the variable j exceeds the length of the array x, resulting in an IndexError.
Proposed Solution: To mitigate this issue, I recommend modifying these two while-loops as follows:
and
By implementing these changes, j will never reach the size of array x, thereby preventing the IndexError from occurring.
Additional Information: