LSSTDESC / rail_sklearn

RAIL algorithms that depend on scikit-learn.
MIT License
1 stars 0 forks source link

Add option to run KNearNeigh with just colors rather than 1 mag + colors #20

Open sschmidt23 opened 2 months ago

sschmidt23 commented 2 months ago

During Raphael Shirley's presentation on rail_lephare, Jeff mentioned that with very disparate mag distributions for training and test, the inclusion of a magnitude in the K nearest neighbor distance calculation could actually make things worse than just using colors without the magnitude. Adding an option to allow running with just colors would let people test this, should be very easy to implement.

Before submitting Please check the following:

raphaelshirley commented 1 month ago

Hi,

I am trying to run with this option set. I am getting a dimension error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[119], line 1
----> 1 knn_estimated = estimate_knn.estimate(test_data_handle)

File ~/Documents/github/lincc/rail_base/src/rail/estimation/estimator.py:97, in CatEstimator.estimate(self, input_data)
     73 """The main interface method for the photo-z estimation
     74 
     75 This will attach the input data (defined in ``inputs`` as "input") to this
   (...)
     94     Handle providing access to QP ensemble with output data
     95 """
     96 self.set_data("input", input_data)
---> 97 self.run()
     98 self.finalize()
     99 return self.get_handle("output")

File ~/Documents/github/lincc/rail_base/src/rail/estimation/estimator.py:110, in CatEstimator.run(self)
    108 for s, e, test_data in iterator:
    109     print(f"Process {self.rank} running estimator on chunk {s} - {e}")
--> 110     self._process_chunk(s, e, test_data, first)
    111     first = False
    112     # Running garbage collection manually seems to be needed
    113     # to avoid memory growth for some estimators

File ~/Documents/github/lincc/rail_sklearn/src/rail/estimation/algos/k_nearneigh.py:195, in KNearNeighEstimator._process_chunk(self, start, end, data, first)
    192         knn_df.loc[np.isclose(knn_df[col], self.config.nondetect_val), col] = np.float32(self.config.mag_limits[col])
    194 testcolordata = _computecolordata(knn_df, self.config.ref_band, self.config.bands, self.config.only_colors)
--> 195 dists, idxs = self.kdtree.query(testcolordata, k=self.numneigh)
    196 dists += TEENY
    197 test_ens = _makepdf(dists, idxs, self.trainszs, self.sigma)

File sklearn/neighbors/_binary_tree.pxi:1183, in sklearn.neighbors._kd_tree.BinaryTree64.query()

ValueError: query data dimension must match training data dimension

Do I need to change the input data handle after setting only_colors=True in the informer?

sschmidt23 commented 1 month ago

I think I set things up so that you had to set the config parameter only_colors=True in both the inform and estimate stage, is it possible that you only set it in one of the two, and thus the model expects a different number of parameters? I could store the boolean for only_colors in the model to ensure that a user can't accidentally use differing values for inform and estimate, if that is less confusing.

If it's not that, then this error could also maybe be caused if you are using a non-default set of names for the magnitudes via bands and do not also supply the dictionary of mag_limits for those bands.

raphaelshirley commented 1 month ago

Ah yes thanks, that fixed it setting it True in the estimate stage too.

sschmidt23 commented 1 month ago

It probably makes more sense to store the only_colors value in the model, since a user shouldn't really be able to set it in the estimate stage if the model was trained for a specific case. I'll make a PR to do just that and merge in so that you only specify in the inform.