simplify instantiation of models

SNEWS2 / snewpy

A Python package for working with supernova neutrinos

https://snewpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

27 stars 22 forks source link

simplify instantiation of models #177

Closed sybenzvi closed 3 months ago

sybenzvi commented 2 years ago

A nice feature for the next snewpy release would be a simplified interface to instantiate models. The use case is programmatically instantiating models using just the model name string plus class-specific configuration parameters passed as kwargs. While the existing SupernovaModel interface simplifies use of code, the user has to import one of N child classes and instantiate a model object with knowledge of the name and location of its spectrum / luminosity files, as well as a heterogeneous set of configuration parameters. A simpler interface could look more like this:

snewpy.models.model('ModelName', arg1, arg2, ...)

Major changes may not be necessary, just tweaks of what already exists. Here is what is needed:

[x] A model registry.
[ ] A factory method that can instantiate any model in the registry.
[x] A backend to manage data files specific to each model, making downloads and disk access transparent to the user.
[x] Improved documentation of configuration parameters in each child of SupernovaModel.

JostMigenda commented 2 years ago

See the previous discussion on this topic.

Note that most existing SupernovaModel subclasses can already be initialised with just the path to a model file as an input argument; so it would be straightforward to write a convenience function that just takes a file path, looks for a directory named Name_YYYY in the path (e.g. Bollig_2016) and then imports the appropriate subclass and returns it. So instead of

from snewpy.models.ccsn import Bollig_2016
b = Bollig_2016('/path/to/SNEWPY_models/Bollig_2016/s11.2')

we could do

from snewpy.models import init_model
b = init_model('/path/to/SNEWPY_models/Bollig_2016/s11.2')

I think that would already cover much of the simplification you’re aiming for?

As noted in the previous discussion, initialising a model only from the file path is an important feature and we should make sure that we don’t make this harder if we make any changes to model initialisation.

sybenzvi commented 2 years ago

@JostMigenda, I don't want the perfect to be the enemy of the good, but I do want to be clear that I'm not proposing to eliminate existing functionality. The proposal is to leave the current classes as they are, and add a registry that does not require the user to know about the data files at all. My reasoning is our present use of data files in constructors is akin to exposing the internal state of the model classes. It's useful for testing new models and should be kept, but IMHO this is not a good design for casual users.

Thus, the registry would know about the valid file inputs and locations for all "official" models in a given snewpy release. Its job is to instantiate model objects and download model data the first time they're needed via calls to get_models. It's all transparent to the end-user. Since the existing model subclasses stay the same, the development path we've established for making and using new models remains unchanged. No one is ever forced to use the registry -- it's syntactic sugar.

I don't think these are mutually exclusive options. As I said, I don't want the perfect to be the enemy of the good. Your solution should work and I'd accept it for the sake of doing what I need to do, but ultimately I think we can package this better.

ps - We can keep discussing here or move to the discussion thread, as you prefer.

JostMigenda commented 2 years ago

As long as it’s possible to initialise all classes with a simple file path (so all the use cases where that’s required remain possible) I’m happy. I disagree with comparing it to internal state of a class (the reason users shouldn’t rely on internals of a class is because we might change those internals at any time—while keeping the public API of that class stable—whereas we shouldn’t change arbitrary file system paths on the user’s computer, so it’s fine if the user relies on those), but if there are users who genuinely prefer such a registry and the code to support that is manageable, then okay.

If we want to unify the user interface of SNEWPY even further, we could even write the function in such a way that it takes either a file path (as above) or keywords, such as

b = init_model(model="Bollig_2016", mass=11.2, eos="LS220")

and then checks the registry under the hood. (And I guess the registry would then construct a file path from the arguments and initialise the appropriate subclass with that file path?)

ps - We can keep discussing here or move to the discussion thread, as you prefer.

I think here is better for visibility; just wanted to make sure the previous discussion isn’t forgotten.

sgriswol commented 2 years ago

Would you be opposed to adding a method to the model classes that takes the model parameters as arguments, constructs the path to the appropriate model file, then return an instantiated model object for the file with the chosen parameters? It wouldn't require changing the existing public interface, though path construction and error checking would need to be tailored to each model.

An example of the definition (assumes some fixed path to where models are stored ModelPath, which could be implemented similar to _model_urls)

class Nakazato_2013:
    ...
    @classmethod
    def from_param(cls, progenitor_mass, revival_time, metallicity, EOS):
        filename = "nakazato-{eos:s}-z{metal:3.2f}-t_rev{t_rev:3d}ms-s{mass:3.1f}.fits".format(
            eos=EOS, metal=metallicity, t_rev=int(revival_time.value), mass=progenitor_mass.value
        )
        # Produces /abs/path/to/registry/Nakazato_2013/...
        path = os.path.join(ModelPath, cls.__name__, filename)
        if not os.path.exists(path):
            raise FileNotFoundError
        # Other error checking...
        return cls(path)

Usage would then be something to the tune of

from snewpy.models import Nakazato_2013
import astropy.units as u  # This was included in lieu of specifying expected units via docstring

param = {
    'progenitor_mass' : 13 * u.msun, 
    'revival_time' : 300 * u.ms, 
    'metallicity' : 0.004, 
    'EOS' : 'shen'
}
model = Nakazato_2013.from_param(**param)

If this scheme is acceptable, I can take a crack at implementing these methods for the current models.

JostMigenda commented 2 years ago

A note on the general approach: In the previous discussion, @Sheshuk and I talked for a while about how to autocomplete these arguments. Personally, I think that will be extremely important to make this whole registry feature work. (Because right now, the easiest way to figure out which parameter values are allowed is to look at the file names of the available model files; that’s already not a great experience and would get much worse if we hide downloads and file paths from the user.) Therefore, I think it might be good to start by implementing this for just 1–2 models[*] and implement the init_model() front end (if we want to use something like that?) and make sure that argument autocompletion works well for that? Otherwise, if we notice that another design would work better, we may have wasted significant time implementing this for all models first.

[*] Nakazato would be a great first model to try, because it has the peculiarity that not all combinations of arguments are valid. Maybe Warren, too, to verify that everything scales well for hundreds of model files.

Other than those questions around autocomplete, your suggestion looks like a good approach to me. I wonder whether it would be nicer to do something like

class Nakazato_2013:
    def __init__(filepath=None, *, progenitor_mass=None, revival_time=None, metallicity=None, EOS=None):
        if not filepath:
            # generate filepath from keyword arguments, as in "from_param()"

        # continue with existing __init__ code here

# The class could then be called either in the current way
Nakazato_2013("/path/to/file")

# or in the new way:
Nakazato_2013('progenitor_mass' : 13 * u.msun, 
              'revival_time' : 300 * u.ms, 
              'metallicity' : 0.004, 
              'EOS' : 'shen'
)

so people can use the standard initializer either way and don’t have to remember to use that classmethod; but I can also see upsides of keeping it separate.

Finally, a few very minor nitpicks to save you a little debugging time later:

instead of revival_time.value, please use revival_time.to('u.ms').value so that it still works if users enter revival_time=0.3*u.s
{metal:3.2f}—that format specifier won’t work here, since the possible values for metallicity are 0.02 or 0.004
Personally, I find f-strings like f"nakazato-{EOS}" much easier to read than "nakazato-{eos:s}".format(eos=EOS)

sgriswol commented 2 years ago

@JostMigenda I've taken my first attempt at it, you can see it at commit 38f34d. I'd be interested in hearing your thoughts.

Loading the Nakazato model from parameters proceeds roughly as follows:

User provides model parameters to the model object as Nakazato_2013(**params) or via kwargs https://github.com/SNEWS2/snewpy/blob/38f34d705f390b85829590cee21bf0161cdec3ba/python/snewpy/models/ccsn.py#L92-L95
The given parameters are checked against the allowed parameter values stored in the registry (some comments below)
If this check completes without issue, the filename/path is constructed and tested for existence.
- Currently, bad combinations of parameters are evaluated via file existence. I had started working on a set of checks that could be used to issue more specific error messages, but it was starting to get cluttered. I left those in as comments for now.
If the file is found, the metadata and tables are loaded the same way as before

As for the registry, I have added a new file _model_registry.py which contains the registry and some helper functions. I'll comment that I marked things here with a leading underscore somewhat arbitrarily (so apologies if that's annoying -- easily fixed though!). The main features in this file are the _model_registry dictionary and the ModelRegistry class, which is intended to serve as an interface. Within the class, there is a method check_param_values which checks a set of parameters for a given model against a set of allowed values and raises errors with specific messages if there are any. I set this up in order to check models with arbitrary parameters, with the intent to follow up with model-specific checks on combinations of parameters.

I wanted to stop for feedback before I got too far. If this is worth continuing, I have a few questions of a nitpicky nature

Should model specific checks on combinations of parameters be defined in the model __init__or in the registry?
- My thought is it would be better to do in the registry
Should the check_param_values have any return value, and should it throw errors
- I have another idea for a scheme where this function returns a boolean indicating "yes, valid parameters" or "no, invalid parameters and here's why..."
Do you have any preferences about how data should be stored in the registry? About where models should be stored?

JostMigenda commented 2 years ago

Hi @sgriswol, thanks for the code!

A couple of comments, in no particular order:

On the _model_registry dictionary:
- The directory for a given class is always os.path.join(data_folder, self.__class__.__name__), right? And since we can construct it automatically (as you do in https://github.com/SNEWS2/snewpy/blob/38f34d705f390b85829590cee21bf0161cdec3ba/python/snewpy/models/ccsn.py#L135), we shouldn’t include it in the dictionary to avoid duplication.
- I think it would be better to include the param dictionary as a class variable within the Nakazato_2013 class. Spreading information about one class across multiple locations in the code makes things harder to maintain and increases the risk of forgetting to update one of the locations.
ModelRegistry: Does this need to be a class? Right now, it looks like you’re only using this to group a few related functions together; but that’s already accomplished by keeping them in a separate module. This may change later as we develop the registry, but for now, I would drop the class and replace from snewpy._model_registry import ModelRegistry as registry with import snewpy._model_registry as registry.
check_param_values(): This is a useful helper function! You can simplify it a bit since the definition of __init__() enforces that physical paramaters are called as keyword-only arguments; so user_param == **kwargs.

Should model specific checks on combinations of parameters be defined in the model init or in the registry?

If there is code that all these model-specific checks have in common, it probably would be good to have that as a separate function alongside check_param_values(). But for the model-specific information, I think it’s better to keep that as part of the model class, as explained above.

Should the check_param_values have any return value, and should it throw errors

I think the way it works now is good.

Do you have any preferences about how data should be stored in the registry? About where models should be stored?

Regarding storing the metadata (allowed params, combinations thereof): Haven’t thought about this too much yet. A dictionary like _model_registry["Nakazato_2013"]["param"] may be fine, but we should reconsider it as we’re working on the autocompletion. Plus, there is also a suggestion in this discussion to change the downloader, which would require some mapping from model parameters to URLs. That may also influence how and where we store the metadata. (Yes, I’m afraid this issue has a lot of hidden complexity … 😫)

Regarding storage on the filesystem: ~/.astropy/cache/snewpy/models/ (as set in 2ce6f74cd0b42a16121ac2933ac47da2e1bf4bfe) seems reasonable; though to be on the safe side, we should use astropy.config.paths.get_cache_dir() instead of hard-coding ~/.astropy/cache/.

sgriswol commented 2 years ago

@JostMigenda I have another round of changes at the head of model_registry_jan22.

The short version

data_folder has been renamed to model_dir and is now defined using astropy.config.paths.get_cache_dir()
- This is currently a bit awkward if user doesn't want to use astropy cache as their download folder
Model parameters are now stored as a member of the model class
Model specific parameter combination checks are performed when at model class init
- This largely informs construction of the model data filename, perhaps this would be useful for constructing urls too?
The registry dictionary has been removed.
- Everything I had in the registry was moved within the model class, so perhaps some refactor from registry would be appropriate.
Added init_modelto registry.py that makes a quick attempt at loading a model using a string of the model's name. If the model cannot be found, one attempt at downloading it is made. (Feature request from @sybenzvi)

JostMigenda commented 2 years ago

This looks good to me thus far!

This largely informs construction of the model data filename, perhaps this would be useful for constructing urls too?

Right now, download URLs are based on the model name and the file name in the repo (e.g. https://github.com/SNEWS2/snewpy/raw/v1.2/models/Nakazato_2013/nakazato-shen-z0.004-t_rev100ms-s13.0.fits), so that could work. If we move the model files out of this repo (see discussion linked above), the URLs would change, though. (A general note: If it’s possible to continue with the existing model downloader, I’d prefer to keep the model instantiation work here separate from any downloader changes; simply because smaller, focussed PRs are a lot easier to handle.)

sgriswol commented 2 years ago

On the topic of a downloader, I agree about keeping that work separate from this. Fortunately, this does use the existing downloader (snewpy.get_models), and the only change made to the download procedure is the user's default model installation directory.

I'll get start on adding init-by-parameter functionality to other model classes.

JostMigenda commented 2 years ago

Before you spend too much time on the other model classes—we still don’t know how to move forward on the topic of model discovery (i.e. autocompletion of arguments, or similar), right? (@Sheshuk, you’ve looked into different ways to do auto-completion a few months ago. Do you have an idea how best to proceed?) As noted earlier, I think it may be better to look into that first and try it with 1–2 models; just in case that requires changes to the current design to work.

Sheshuk commented 2 years ago

Hi, yes, I did several tests with the ipython parser (as described https://github.com/SNEWS2/snewpy/discussions/139#discussioncomment-1671811) which worked for me in Jupyter: it makes custom autocompletions, and can fall back to default ones if we're not in the middle of the model instantiation string. This parser can be loaded in the snewpy/models/init.py if we detect that ipython is used.

Unfortunately I'm away from the computer now, and won't be able to work on this for the next week or so. But I can get to it afterwards

sybenzvi commented 2 years ago

Hi @JostMigenda, I'd like to decouple the (eventual) PR for @sgriswol's changes from the autocompletion update, largely because I suspect the two sets of changes are orthogonal. How about we make the PR based on the existing model tested, then add more models later and enable the autocompletion feature? I'm going to set up the PR now, and we can iterate as needed before the merger.

JostMigenda commented 2 years ago

I think there are two aspects that might mean these changes are not completely orthogonal: one regarding code, one regarding user experience.

Regarding code: I suspect you’re right that this is mostly orthogonal. The one open question I see is around the Nakazato_2013 model, where some combinations of parameters are invalid (e.g. the togashi EOS can only be used for the 30 Msol progenitor). Should we duplicate the code that does those additional checks (in the Nakazato_2013 class itself and in the autocompletion code), or can we modify it to handle that case better? (Though that alone would probably still be a fairly minor overlap, and not reason enough to delay this.)

Regarding UX: This is my main issue. Right now, the easiest way to figure out which parameter values are allowed is to look at the file names of the available model files. If we place model files into the astropy cache directory and thus hide their file name from the user, what ways do they have left to identify what the allowed parameter values are? Without some way to solve this problem, I don’t think this new model instantiation is usable for many SNEWPY users. Of course, that solution doesn’t necessarily have to be autocomplete; maybe in the interim it would be sufficient to add allowed parameter values to the documentation? (But can we do that without duplicating information? Otherwise, over time the documentation and code/model files will become out of sync and we create a lot more confusion …)

JostMigenda commented 2 years ago

Unrelated to the autocomplete question: Right now, when downloading the files for a specific model family, snewpy.get_models() automatically downloads the README file and notebook with sample code for that model to the same directory, so the user immediately has an obvious place to start exploring a model. If the download directory is hidden to the user by default, how do we make sure the user still has access to these files?

sybenzvi commented 2 years ago

I had a long comment about how we put it into the astropy cache but you know that. Editing this reply, I suggest we explain in the README where these models go by default + document the option to download files to a user-specified location. I don't want users to not know where the files go... the point of the astropy cache is just to eliminate the need to remember absolute paths for those who don't want to.

Regarding your comment about allowed parameters, there is a param member of Nakazato_2013 that gives the list of parameters. But you want to have something that prevents users from trying a bunch of different parameters that may be incompatible? I don't have a good programmatic solution for that off the top of my head, nor am I sure this is our problem to solve. If someone specifies incompatible parameters, isn't it enough to emit a RuntimeError or a ValueError when they try to instantiate the model?

Sheshuk commented 2 years ago

Would it make sense to put the models examples in the documentation, instead of bringing along the notebooks for each model? To me as a user it seems more natural to search for the example there. We can even have automatic generation of the images, making sure that our description is up to date.

ср, 9 февр. 2022 г., 06:05 Segev BenZvi @.***>:

What we changed for the registry is to have models downloaded a snewpy/models subfolder of astropy.config.paths.get_cache_dir(); see https://github.com/SNEWS2/snewpy/blob/model_registry_jan22/python/snewpy/__init__.py#L17. On most systems the base path resolves to $HOME/.astropy.cache. The point is that the default models are available anywhere on the system without requiring the user to know a specific absolute path or specify any environment variables to point to the SNEWPY source.

We can update the README to indicate the new location + add instructions on how to download models to a user-specified directory.

— Reply to this email directly, view it on GitHub https://github.com/SNEWS2/snewpy/issues/177#issuecomment-1033108044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPKUXMKLAOZBG5YGUHBPCLU2GHTNANCNFSM5MNBTGEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

JostMigenda commented 2 years ago

Regarding your comment about allowed parameters, there is a param member of Nakazato_2013 that gives the list of parameters.

Hm … does sphinx have a way to include the value of a class variable like that in the documentation? If so, that would be a way to solve this. (As noted above, I’d like to avoid copying that info into a docstring to avoid duplication.)

But you want to have something that prevents users from trying a bunch of different parameters that may be incompatible?

Ideally, yes. If we tell users about all possible values for each parameter but not which combinations are actually valid, we’re effectively inviting them to play a game of Battleship; which could be frustrating. On the other hand, maybe I’m putting too much emphasis on this? If there’s no clean and easy way to solve this, it’s probably fine to just note in the docstring that not all combinations are valid. (In an ideal world, of course, anyone using the Nakazato model for physics analyses would read the corresponding papers and learn the valid parameter combinations from those.)

Would it make sense to put the models examples in the documentation, instead of bringing along the notebooks for each model? To me as a user it seems more natural to search for the example there.

That’s a good point. And it looks like there are plugins that let Sphinx include notebooks directly—I didn’t know about this, but this looks very cool! (I’ll open a separate issue on this, since it is only tangentially related to the model instantiation.)

sgriswol commented 2 years ago

Hi @JostMigenda I was pulled to work on other projects, so I've been content to let this sit on the back burner while waiting for the autocomplete issue to progress. However, those projects are beginning to pull me back towards SNEWPY and this question of model instantiation.

Regarding your comment about UX/battleship, I agree. For now, perhaps the easiest solution would be to provide a list of allowed combinations as a class member. This would be rather inelegant, but it could help to reduce the amount of guess work.

An alternative would be to dynamically generate the list of combinations. First using itertools.product on the model.param member then testing it against some logic for combining parameters as in https://github.com/SNEWS2/snewpy/blob/d1dfec61831006f5dc9143043a84ba552cb1f27f/python/snewpy/models/ccsn.py#L116-L130 Perhaps this could be stored as a class member during __init__ and generated dynamically otherwise. On the UX side, the model combinations would then be accessible from the model object.

I'm not able to dive into this today, but I could have something ready for you to look at by Monday. Do you think this would be an acceptable solution?

JostMigenda commented 2 years ago

For those models, where all combinations of valid parameters are valid, do we need a list of combinations at all? Or is model.param sufficient? For Nakazato, yes, let’s try if that works for dealing with valid/invalid combinations. And if the checks of other parameters get too complicated, maybe we just update the error message to explicitly point people towards the respective papers to see what combinations are valid; that may not be as nice for users, but at least it’s actionable (which makes it better than Battleships).

sgriswol commented 2 years ago

I've added some functionality to the Nakazato models on the combination side of things: a user may now request a list of all valid model parameter combinations via the get_param_combinations() class method. The user may also filter this list using valid parameter values IE Nakazato_2013.get_param_combinations(progenitor_mass=30*u.Msun), and may check if a particular combination c is valid using Nakazato_2013.isvalid_param_combo(c). As an aside, this may be useful when addressing the question of autocompletion, as adding arguments to get_param_combinations can help filter down a user's selection of models.

>>> Nakazato_2013.get_param_combinations(progenitor_mass=30*u.Msun)
>>> Nakazato_2013.get_param_combinations(progenitor_mass=30*u.Msun, eos='shen')  # Params must be kwargs, so order doesn't matter
>>> Nakazato_2013.get_param_combinations(progenitor_mass=30*u.Msun, eos='shen', revival_time=0*u.ms)

The models.registry module has been expanded with functions to retrieve the list of valid combinations in a model-independent way (assuming the presence of the above class members/methods), and some of the error checks have been streamlined.

JostMigenda commented 3 months ago

Issue #225 has a comment on that initialiser; the rest of this issue is done, so we can close this.