Open hangqianjun opened 5 months ago
Ok, I think we just need to add checks that the bands specified are all present in the data, and check that the ref_band and nondetect_val bands dict keys are also present (I bet it's the latter actually causing the error). That is simple to add, I can do it. KNN is actually in rail_sklearn, though, so I'll probably create an issue there and close this one.
Hi @sschmidt23! Thanks for addressing this. I was wondering if this is a general issue with estimators with a similar structure (i.e. they need to check if the input columns etc. match up with the model). Do you think there could be some more general function to check those things? Or is it easier to implement in a per-estimator base?
I agree with @hangqianjun that it would be ideal to add this into the estimator base class.
I've also thought about this, and I think that there are sometimes some slight differences in how the inputs are handled, so implementation may be slightly awkward. But, maybe not. I'm going to add a fix for KNearNeigh and maybe we can discuss further on the Slack channel about any difficulties in implementation at the estimation level (e.g. maybe a function call that we add to all informers and estimators that has a default that can be overridden for subclasses that do something non-standard with the data).
The tricking thing about doing this with a function call is that you don't actually know what is in the file until you open it... When you are running in a parallel you need to be careful about how to handle this. I could imagine only calling the function if you are in the first chunk.
I was just about to bring that up here: I added checks in rail_sklearn to the informer stage for KNearNeighInformer, but did not add the equivalent checks to the estimator stage because I was wondering what the best way to do that was given the process_chunk call there, I wasn't sure of the best way to do it without checking for every chunk.
@eacharles Agree with calling it in the first chunk only.
So looking at how we do things now, in the informer parent class we open the input data files in the run
method, and because each estimator class has its own custom run
that overrides the base class run
, we would need to move the data read in and the checks that we want to add so that they are called before run
. This seems pretty straightforward for inform, I'm less sure how to do it for estimate generically (other than opening the input file twice, once just for the checks and then leaving the current estimate
as-is). If anyone has more clever and efficient ideas for how to modify estimate posting them here for discussion would be great.
And it does seem like this would touch pretty much all of the estimators and informers, so someone should probably put together one of the design docs and present it to everyone and such so that we're keeping everyone in the loop and making sure that we gather input from everyone.
Drafting a design doc here: link
I encountered the following erorr message when using KNN estimator:
The error is actually due to the fact that the column names in the informer is different from the test data. However this error message is not helpful in identifying the issue. It would be good to add some checks and give more informative error messages.
@aimalz's student has encountered similar issues.