Estimator error message unclear

hangqianjun commented 5 months ago

I encountered the following erorr message when using KNN estimator:

File /opt/desc/py/lib/python3.10/site-packages/rail/estimation/algos/k_nearneigh.py:189, in KNearNeighEstimator._process_chunk(self, start, end, data, first)
    187         knn_df.loc[np.isnan(knn_df[col]), col] = self.config.mag_limits[col]
    188     else:
--> 189         knn_df.loc[np.isclose(knn_df[col], self.config.nondetect_val), col] = self.config.mag_limits[col]
    191 testcolordata = _computecolordata(knn_df, self.config.ref_band, self.config.bands)
    192 dists, idxs = self.kdtree.query(testcolordata, k=self.numneigh)

File /opt/desc/py/lib/python3.10/site-packages/numpy/core/numeric.py:2348, in isclose(a, b, rtol, atol, equal_nan)
   2345     dt = multiarray.result_type(y, 1.)
   2346     y = asanyarray(y, dtype=dt)
-> 2348 xfin = isfinite(x)
   2349 yfin = isfinite(y)
   2350 if all(xfin) and all(yfin):

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The error is actually due to the fact that the column names in the informer is different from the test data. However this error message is not helpful in identifying the issue. It would be good to add some checks and give more informative error messages.

@aimalz's student has encountered similar issues.

sschmidt23 commented 5 months ago

Ok, I think we just need to add checks that the bands specified are all present in the data, and check that the ref_band and nondetect_val bands dict keys are also present (I bet it's the latter actually causing the error). That is simple to add, I can do it. KNN is actually in rail_sklearn, though, so I'll probably create an issue there and close this one.

hangqianjun commented 5 months ago

Hi @sschmidt23! Thanks for addressing this. I was wondering if this is a general issue with estimators with a similar structure (i.e. they need to check if the input columns etc. match up with the model). Do you think there could be some more general function to check those things? Or is it easier to implement in a per-estimator base?

aimalz commented 5 months ago

I agree with @hangqianjun that it would be ideal to add this into the estimator base class.

sschmidt23 commented 5 months ago

I've also thought about this, and I think that there are sometimes some slight differences in how the inputs are handled, so implementation may be slightly awkward. But, maybe not. I'm going to add a fix for KNearNeigh and maybe we can discuss further on the Slack channel about any difficulties in implementation at the estimation level (e.g. maybe a function call that we add to all informers and estimators that has a default that can be overridden for subclasses that do something non-standard with the data).

eacharles commented 5 months ago

The tricking thing about doing this with a function call is that you don't actually know what is in the file until you open it... When you are running in a parallel you need to be careful about how to handle this. I could imagine only calling the function if you are in the first chunk.

sschmidt23 commented 5 months ago

I was just about to bring that up here: I added checks in rail_sklearn to the informer stage for KNearNeighInformer, but did not add the equivalent checks to the estimator stage because I was wondering what the best way to do that was given the process_chunk call there, I wasn't sure of the best way to do it without checking for every chunk.

hangqianjun commented 5 months ago

@eacharles Agree with calling it in the first chunk only.

sschmidt23 commented 5 months ago

So looking at how we do things now, in the informer parent class we open the input data files in the run method, and because each estimator class has its own custom run that overrides the base class run, we would need to move the data read in and the checks that we want to add so that they are called before run. This seems pretty straightforward for inform, I'm less sure how to do it for estimate generically (other than opening the input file twice, once just for the checks and then leaving the current estimate as-is). If anyone has more clever and efficient ideas for how to modify estimate posting them here for discussion would be great.

And it does seem like this would touch pretty much all of the estimators and informers, so someone should probably put together one of the design docs and present it to everyone and such so that we're keeping everyone in the loop and making sure that we gather input from everyone.

hangqianjun commented 5 months ago

Drafting a design doc here: link

LSSTDESC / rail_base

Estimator error message unclear #120