desihub / redrock

Redshift fitting for spectroperfectionism
BSD 3-Clause "New" or "Revised" License
22 stars 13 forks source link

Segmentation Fault in Redrock when running on WEAVE data #256

Closed amolaeinezhad closed 11 months ago

amolaeinezhad commented 1 year ago

Issue Summary: Segmentation Fault in Redrock when running on WEAVE data

Description: In rare instances, while running Redrock on WEAVE data -especially in very low SNR spectra-, a segmentation fault occurs, originating from the rebin._trapz_rebin_1d function. A detailed investigation (after disabling the numb.jit and isolating the problematic cases and running the code in non-parallel mode) revealed that the issue lies in the following lines of code in rebin._trapz_rebin_1d():

while x[j] < edges[i]:
    j += 1

and

while x[j+1] < edges[i+1]:
    j += 1

In some very uncommon cases, the variable j exceeds the length of the array x, resulting in an IndexError.

Proposed Solution: To mitigate this issue, I recommend modifying these two while-loops as follows:

while x[j] < edges[i] and j < len(x) + 2:
    j += 1

and

while x[j+1] < edges[i+1] and j < len(x) + 2:
    j += 1

By implementing these changes, j will never reach the size of array x, thereby preventing the IndexError from occurring.

Additional Information:

Environment: Non-GPU mode, Non MPI
sbailey commented 1 year ago

Thanks for reporting this and for tracking down the underlying problem; I'm sure that was some tedious debugging.

If I understand the bug correctly, this case in _trapz_rebin_1d should only occur if the last edge is outside the range of the input x, i.e. edges[-1] > x[-1]. This is ill-defined for what the trapz integration of the last bin should be since it covers wavelengths outside of the data, and the non-jitted trapz_rebin wrapper treats this as an error case with

    if (edge_min < xmin or edge_max > xmax):
        raise ValueError('edges must be within input x range')

By default xmin = x[0]*(1+myz.max()) and xmax = x[-1]*(1+myz.min()) and I don't think this error condition should occur, but if the user provides different xmin, xmax, edge_min, edge_max (as an optimization) then those are trusted and the chain trapz_rebin -> _trapz_rebin_batch(x, y, edges, myz, ...) -> _trapz_rebin_1d(redshifted_x, ...) could violate the edge condition.

@amolaeinezhad could you explain more about what your rare case is that is triggering this bug? It appears that your proposed fix would avoid the segfault, but also silently not calculate the last ill-defined bin when I think it should be more of an explicit error case if we are asking trapz_rebin to integrate outside of the bounds of the data. Or if there is a case that I've missed that triggers this problem even when edges are within the data, please explain that. Thanks.

amolaeinezhad commented 1 year ago

Hi @sbailey. Thanks for getting back to me so quickly. I wanted to dig into this bug a bit more. When we run Redrock with the WD template on a low signal-to-noise WEAVE spectrum (R=5000) in both the Blue (3800.0-5800.0 Å) and Red (5901.0-9282.0 Å) arms, everything goes smoothly until it hits the "fitz" function. Then, things get a bit wild. Check out what we get from this line of code:

zmin, sigma, chi2min, zwarn = minfit(zz[i-1:i+2], zzchi2[i-1:i+2])

The values returned are as follows:

As you can see, these results are way out of bounds, and the code should ideally catch this during the try-catch part, but it just cruises on. Now, it goes through "rebin_template" and then "trapz_rebin." Here's where it should be checking for the "edges[-1] > x[-1]" condition, just as you mentioned, with code like this:

if (edge_min < xmin or edge_max > xmax):
    raise ValueError('edges must be within the input x range')

However, in the "trapz_rebin" function, the check (the "edges[-1] > x[-1]" condition via "if edge_min < xmin or edge_max > xmax") only applies to the unredshifted x and therefore it happily skips along. In fact, if in the code xmin or xmax has been set to "None," the code automatically applies redshift to them but not if xmin or xmax are already given by the parent function. In another word, as xmin and xmax have been provided by the parent function, the code runs the edges[-1] > x[-1] condition on the unredshifted wave.

Now, it all starts to fall apart in the next stage, where we dive into "_trapz_rebin_batch" or "_trapz_rebin_1d." These functions deal with redshifted x, and they don't meet the "edges[-1] > x[-1]" condition. This results eventually an index error in the while conditions (or segmentation fault in case of enabling jit)

So, here's the fix suggestion: check the "edges[-1] > x[-1]" condition on the redshifted x. This should keep things in line. Or somehow better handling the out-out-the-bands initial redshift guess by minfit in the very beginning step by adding the IndexError to the exception part in fitz function:

        except (ValueError, IndexError) as err:
            if zmin<redshifts[0] or redshifts[-1]<zmin:
                #- beyond redshift range can be invalid for template
                coeff = np.zeros(template.nbasis)
                zwarn |= ZW.Z_FITLIMIT
                zwarn |= ZW.BAD_MINFIT
            else:
                #- Unknown problem; re-raise error
                raise err

I hope this helps clarify the situation, and if you need more info or anything else, feel free to holler.

Cheers,

sbailey commented 1 year ago

Thanks for the further details. I absolutely agree that there is a problem, but I want to make sure that we solve the core underlying problem. A few comments:

In your return values from fitz.py L213

zmin, sigma, chi2min, zwarn = minfit(zz[i-1:i+2], zzchi2[i-1:i+2])

there is zwarn=1024 which is the flag for BAD_MINFIT. i.e. Redrock already knows at that point that it's initial estimate for that minimum is way off and it arguably shouldn't even bother entering the try/except block to refine it. As a side note, it might be interesting for you to inspect zzchi2 vs. zz to see why bin i was selected as a minimum even though the parabola fit is super bad. But that's admittedly not the core problem -- even with a bad zmin, Redrock shouldn't hit a range error segfault anyway.

So, here's the fix suggestion: check the "edges[-1] > x[-1]" condition on the redshifted x.

Agreed, but there are already several places where it is attempting to do that so I want to understand why those attempts aren't working. e.g. within trapzrebin.py rebin_template L490-L493:

if (xmin is None):
        xmin = template.minwave*(1+myz.max())
    if (xmax is None):
        xmax = template.maxwave*(1+myz.min())

i.e. it's trying to consider the x min/max range that can be covered by the template wavelengths given the input redshifts myz.

I suspect the bug (or at least a bug) is back in fitz.py fitz L221-L222:

            xmin = template.minwave*(1+zmin)
            ximax = template.maxwave*(1+zmin)

Note ximax instead of xmax, i.e. xmax isn't getting updated for the new zmin, and then it is getting passed into rebin_template which is preventing it from updating its xmax internally, so it proceeds with the wrong xmax which could cause the indexing error.

Could you try fixing that L222 ximax typo in your local copy and rerunning to see if that fixes the problem? If not, we'll need to dig deeper to see why the existing redshifted-x checks aren't working (that might already be clear to you, but it isn't yet clear to me). Thanks.

amolaeinezhad commented 12 months ago

Thank you for providing a solution! I've implemented the fix by correcting the typo (ximax--> xmax) in fitz.py at lines 221-222, and indeed, the segmentation fault issue seems to be resolved.

sbailey commented 11 months ago

Fixed in PR #258. Thanks for the report and the patience debugging the root cause.