Check input types in DataHandle classes

OliviaLynn commented 1 year ago

We should add explicit type checks to inputs and throw exceptions if they are the wrong type, with an explanation of what’s wrong.

An example of this in GalSim: link.

This has come up recently with a relatively new user building a pipeline and receiving somewhat cryptic errors.

I remember goldenspike being broken last October because some package it used had switched to jax arrays over the previous numpy arrays (we had to add the type check here), so it looks like this may continue being a problem if we don't take the time to address it now.

rmandelb commented 1 year ago

Thanks for opening this issue, Olivia! Just to add to this (as I'm part of the project that inspired this issue):

In my experience, users outside of the development team may not always be aware of the right usage for given functions. Providing the wrong type for the inputs, or misunderstanding the type of outputs, is a really common failure mode. If you explicitly check and raise a useful error about what's wrong, they can quickly learn the right usage. If you do not do a type check, they often will get a cryptic error due to the code trying to do something that isn't allowed for their particular input somewhere within the function. While Olivia shared a specific example in GalSim, it was something we tried to do systematically, and it has been quite helpful for users.

I think the issue is relevant not just to tables_io (where it happens to have showed up) but also to the RAIL ecosystem, as DESC members will be using these codes for research projects meant to prepare for cosmological analyses using photo-z, and these checks will support those users' adoption and effective usage of the code.

eacharles commented 1 year ago

It would be good to clarify the scope of what you have in mind. Are you more concerned with container type (I.e., Jax v numpy) or the data type (Ie float v double)? the first is relative straightforward, modolu the fact that python is duck-typed, so sometime “type checking” is not actually a well defined thing, the later is much more involved.On Jul 26, 2023, at 9:45 AM, Olivia R. Lynn @.***> wrote: We should add explicit type checks to inputs and throw exceptions if they are the wrong type, with an explanation of what’s wrong. An example of this in GalSim: link. This has come up recently with a relatively new user building a pipeline and receiving somewhat cryptic errors. I remember goldenspike being broken last October because some package it used had switched to jax arrays over the previous numpy arrays, so it looks like this may continue being a problem.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

rmandelb commented 1 year ago

Obviously I cannot speak for Olivia, but personally I was referring primarily to the case where the inputs are a particular class with methods that the function is going to try to use (like a pandas dataframe versus a qp.Ensemble versus a numpy array versus a list/tuple), rather than e.g. float vs. int. The manifestation of the issue is the cryptic error that you get when the function tries to use some method of the class it's expecting, and it's preventable through an explicit check of the type.

eacharles commented 1 year ago

It’s a bit tricky, b/c in some cases we would very like to be able to handle a panda data frame OR and dictionary of numpy arrays. That is what I was referring to about duck-typing.

-e

On Jul 26, 2023, at 1:57 PM, Rachel Mandelbaum @.***> wrote:

Obviously I cannot speak for Olivia, but personally I was referring primarily to the case where the inputs are a particular class with methods that the function is going to try to use (like a pandas dataframe versus a qp.Ensemble versus a numpy array versus a list/tuple), rather than e.g. float vs. int. The manifestation of the issue is the cryptic error that you get when the function tries to use some method of the class it's expecting, and it's preventable through an explicit check of the type.

— Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/rail_base/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIR2OJRBBMQGZPZK723XSGADNANCNFSM6AAAAAA2Y3UKAM. You are receiving this because you commented.

eacharles commented 1 year ago

In short, I think is probably something we should think about and discuss carefully next week.

On Jul 26, 2023, at 2:14 PM, Eric Charles @.***> wrote:

It’s a bit tricky, b/c in some cases we would very like to be able to handle a panda data frame OR and dictionary of numpy arrays. That is what I was referring to about duck-typing.

-e

On Jul 26, 2023, at 1:57 PM, Rachel Mandelbaum @. @.>> wrote:

Obviously I cannot speak for Olivia, but personally I was referring primarily to the case where the inputs are a particular class with methods that the function is going to try to use (like a pandas dataframe versus a qp.Ensemble versus a numpy array versus a list/tuple), rather than e.g. float vs. int. The manifestation of the issue is the cryptic error that you get when the function tries to use some method of the class it's expecting, and it's preventable through an explicit check of the type.

— Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/rail_base/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRIGIR2OJRBBMQGZPZK723XSGADNANCNFSM6AAAAAA2Y3UKAM. You are receiving this because you commented.

rmandelb commented 1 year ago

Yes, completely understood that in some cases multiple types would work (we've dealt with that situation in GalSim before as well). I won't be at the meeting next week but I'm happy to answer further questions here asynchronously.

eacharles commented 1 year ago

Do we have an example of these cryptic errors that people are receiving, and the data files that generated them. I think that could help the discussion.

Tables_io is already trying to strike a balance between checking that things really are tables, and being able to support things that effectively act as tables, so it would be really helpful to have additional examples where this isn't working.

eacharles commented 1 year ago

The point about some downstream code trying to use some function doesn't exist is particulary tricky here.

The idea tables_io is that it makes it easy to read things that are "table-like" in the sense that provide certain functionality, like have len() and getitem do preditable things. We haven't been explict about what "table-like" means, and we could clarify that.

On the other hand, some "table-like" classes provide additional functionality like join() or whatever, and in that case the code is going to crash when that function is called on a different type of table that doesn't support that function.

I'm not sure the later case actually something we should be trying to catch in tables_io, as nice as it would be to be able to do that. We should discuss this more.

eacharles commented 1 year ago

Screen Shot 2023-07-25 at 10 58 39 AM

eacharles commented 1 year ago

That's the screenshot Olivia sent me relating to this issue.
If I understand correctly, the issue is that datafile, which looks be some sort of DataHandle wrapping a qp.Ensemble object got passed to a ColumnMapper stage, which expects a catalog-like table. Without the rest of the notebook I'm not sure why this happened.

It doesn't look like tables_io is involved in this.

The docstring for ColumnMapper says:

    Utility stage that remaps the names of columns.

    Notes
    -----
    1. This operates on pandas dataframs in parquet files.

    2. In short, this does:
    `output_data = input_data.rename(columns=self.config.columns, inplace=self.config.inplace)`

That looks pretty explicit to me. Also, the input and output methods for this class are PqHandle objects, so I think that in pipelines usages trying to read the wrong type of file would trigger a pretty clear error about file types from tables_io.

However, the interactive routine, __call__(self, data): doesn't do explicit type-checking, (though it could, since it clearly expects a pandas.DataFrame object), and has an overly vague docstring, that says it tables table-like data, when actually it only takes DataFrames. Also, the RailStage.set_data() method could potentially do some more type checking.

Anyway, if this was the source of the somewhat cryptic error messages that triggered this issue, I'm pretty sure the place we want fix is ColumnMapper in rail_base, and we might think about having the DataHandle classes do some checking when they are passed data via the set_data method.

eacharles commented 1 year ago

Can we confirm that this issue relates to the screen shot above? If so, I think we should close it, and instead consider putting checks in the DataHandle classes in rail.

aimalz commented 1 year ago

I took the liberty of migrating this issue because we came to the above conclusion at the retreat and just hadn't followed up on it yet. So, this issue can be closed when the requested checks are implemented in the various DataHandle classes.

drewoldag commented 10 months ago

So in the interest of being pretty clear. What checks are being discussed here?

Given the following class hierarchy in src/rail/core/data.py:

DataHandle
- TableHandle
- Hdf5Handle
- FitsHandle
- PqHandle
- QPHandle
- ModelHandle

I'm assuming that we want to put in new __init__ methods for the Table, QP, and Model classes that verify that the input is either "table-like", a qp Ensemble, or a model respectively, and return a reasonable error message if not.

For QPHandle it would be something like:

def __init__(self, tag, data=None, path=None, creator=None): 
    if not isinstance(data, qp.Ensemble):
        logger.error(f"Expected a qp.Ensemble object, but found {}", type(data))
    super().__init__(tag, ...)

I'm also assuming that we don't need type checking in the Hdf5, Fits, and Pq classes, because those are just particular containers for "table-like" data.

Finally, given @eacharles's comment on Aug 1, we would want to introduce a check for an input pandas dataframe for __call__ as well as updating the docstring.

I fully recognize that I'm making a lot of assumptions here, so please let me know what all I am incorrect about.

eacharles commented 10 months ago

I think just putting in a method cls._validate_data() and calling that in two places in the Base class is enough: init and set_data() They we can have the sub-classes implement the _validate_data method.

Problem is that the only class where this is easy is QPHandle. It all the others it requires some understanding for what the Stage that uses the data expected. I.e., what kind of model, what kind of table, etc...

eacharles commented 10 months ago

to elaborate, you can pretty much hand any pickleable object to the ModelHandle, and that's ok, b/c all it does in pickle it and write it to disk or read it back from disk.

Similarly, you can hand anything that tables_io can write to the particular file format you requested to one of the TableHandles or it's subclasses and that's ok. You could ask tables_io if it can figure out how to write a particular object to disk in a particular format, but that isn't going to be much help, b/c it can write, say, a numpy.recarray to a parquet file, even if the stage using the data handle doesn't know how to handle it.

eacharles commented 10 months ago

Done with #54

LSSTDESC / rail_base

Check input types in DataHandle classes #47