NNPDF / reportengine

A framework for declarative data analysis
https://data.nnpdf.science/validphys-docs/guide.html
GNU General Public License v2.0
1 stars 2 forks source link

Lockfile redux #41

Open wilsonmr opened 3 years ago

wilsonmr commented 3 years ago

attempting to simplify lockfiles now that we understand the use case better. This will clearly break any pre-existing lockfile useage and will require a corresponding PR in NNPDF/nnpdf to update any functions which use this machinery.

wilsonmr commented 3 years ago

huh, forgot there were tests for this. Well we can use those to see if the template works. @Zaharid do you think this is approximately what you had in mind? I also updated docs to try and be a bit clearer.

wilsonmr commented 3 years ago

hmm I was wondering if we could completely hide the <key_recorded_spec) from the user:

lockkey = trim_token(f.__name__) + "_recorded_spec_"
    @functools.wraps(f)
    def f_(self, *args, **kwargs):
        # either returns pre-existing <key>_recorded_spec_ or loaded defaults
        try:
            _, res = self.parse_from_(None, lockkey, write=False)
        except ConfigError as e:
            log.debug(e)
            res =  f(self, *args, **kwargs)
        # save to lockfile if not present.
        if lockkey not in self.lockfile:
            self.lockfile[lockkey] = res
        return res
    return f_

then you just decorate a function which loads the defaults

wilsonmr commented 3 years ago

the thing I dislike about that is that if the parse from fails for an obscure reason then that gets hidden. Also I don't think ConfigError is neccessarily the correct error.

Also should we add an identity parse func to the class called parse_<key>_recorded_spec_ when we didn't have this sometimes an inner key of the defaults can mess with the user defined parse functions

wilsonmr commented 3 years ago

So upon initial test this roughly does what we want and I think the user facing interface is fairly intuitive. I still don't like catching the error. I wonder if we could just explictly take the input and check if the recorded spec key is in it? After all we don't want fancy parsing here, we just want to load whatever was dumped so like

input_params = self._curr_input
if lockkey in input_params:
    # maybe log info that we're loading defaults here?
    return input_params[lockkey]
Zaharid commented 3 years ago

Just to see I'll need some time to look into this. But @siranipour opinion would be useful here.

siranipour commented 3 years ago

Yep looking into it literally rn, it's looking more intuitive on first glance, just playing with the VP rules implementation

Zaharid commented 3 years ago

FWIW I am thinking that one not obvious to me aspect, and possibly the thing that motivated the previous design in the first place the question of how to handle defaults that would apply to a specific dataset (e.g. default cfactors). With this update we would only have the option of writing all the defaults for all the datasets to the lockfile. But maybe that's ok.

siranipour commented 3 years ago

Yeah also one thing that's not clear to me is how we can create new defaults. E.g say in NNPDF5.0 we have a new grouping like

standard_report: experiment
theory_covariance_report: nnpdf31_process

NNPDF5_report: NNPDF5_value

would we just have to update the default_groupings.yaml file? Or create a new one in conjunction with the og defaults file?

sorry to add entropy to the discussion btw

siranipour commented 3 years ago

Another thing is that it doesn't like dumping custom classes. As a a minimal example:

from reportengine.compat import yaml

class Foo(yaml.YAMLObject):
    yaml_tag = u"Foo"
    def __init__(self, a):
        self.a = a

    def __repr__(self):
        return str({"a" : self.a})

if __name__ == '__main__':
    bar = Foo(5)

    with open("/tmp/foo.yaml", 'w') as stream:
        yaml.dump(bar, stream=stream, Dumper=yaml.RoundTripDumper)

gives

RepresenterError: cannot represent an object: {'a': 5}
wilsonmr commented 3 years ago

FWIW I am thinking that one not obvious to me aspect, and possibly the thing that motivated the previous design in the first place the question of how to handle defaults that would apply to a specific dataset (e.g. default cfactors). With this update we would only have the option of writing all the defaults for all the datasets to the lockfile. But maybe that's ok.

Yes I think the original idea was that we'd only have to dump what we used. I wonder if there is a halfway house where the production rule which was decorated would look like

@record_from_defaults
def produce_key(self, spec_level_1, spec_level_2, ...):
    # load defaults
    ...
    return loaded_defaults[spec_level_1][spec_level_2]

We'd be locking in the exact structure of the defaults but I think nested mapping is not so crazy. Then the wrapper would check if key_recorded_spec_ was None and if not it would return key_recorded_spec_[spec_level_1][spec_level_2] and likewise dump self.lockfile[key_recorded_spec_][spec_level_1][spec_level_2] if it didn't exist. Obviously the code that dumps the defaults becomes slightly more complex because it would need to be agnostic to the depth of the nested mapping, but otherwise I think that this could do what you wanted (but also one could choose not to do that by simply not passing the production rule any arguments)

wilsonmr commented 3 years ago

Yep looking into it literally rn, it's looking more intuitive on first glance, just playing with the VP rules implementation

I wouldn't put too much effort into integrating that until this is more stable. But thanks for looking

Yeah also one thing that's not clear to me is how we can create new defaults. E.g say in NNPDF5.0 we have a new grouping like

standard_report: experiment
theory_covariance_report: nnpdf31_process

NNPDF5_report: NNPDF5_value

would we just have to update the default_groupings.yaml file? Or create a new one in conjunction with the og defaults file?

sorry to add entropy to the discussion btw

well the point is that the mapping which your production rule returns must have your new key in it, so you can either append to the file as you say or you could put the default in a new file but make sure the result was added to the returned mapping. So for example you could have (this is stupid but if the default values were more complicated it might make more sense):

standard.yaml:

experiment

theorycov.yaml:

nnpdf31_process

then something like

        @record_from_defaults
        def produce_default_grouping(self):
            loaded_defaults = dict(
                standard_report=yaml.safe_load(
                    read_text(validphys.defaults, "standard.yaml")),
                theory_covariance_report=yaml.safe_load(
                    read_text(validphys.defaults, "theorycov.yaml")),
            )
            return loaded_defaults

which might be relevant for the filter rules because they're quite long so you could load seperate files. In this case I'm not sure a single string is a valid yaml file but imagine they were lists like the filter rules.

TL;DR: The only fixed thing here is that the production rule returns a mapping with all defaults in, it doesn't matter how they were loaded.

wilsonmr commented 3 years ago

Another thing is that it doesn't like dumping custom classes. As a a minimal example:

from reportengine.compat import yaml

class Foo(yaml.YAMLObject):
    yaml_tag = u"Foo"
    def __init__(self, a):
        self.a = a

    def __repr__(self):
        return str({"a" : self.a})

if __name__ == '__main__':
    bar = Foo(5)

    with open("/tmp/foo.yaml", 'w') as stream:
        yaml.dump(bar, stream=stream, Dumper=yaml.RoundTripDumper)

gives

RepresenterError: cannot represent an object: {'a': 5}

I think for now it's safe to assume that defaults are things we would write in a runcard and therefore are not custom objects.

wilsonmr commented 3 years ago

Furthermore, the issue with your example is with your implementation:

https://yaml.readthedocs.io/en/latest/dumpcls.html#dumping-python-classes

EDIT:

i.e

from reportengine.compat import yaml

class Foo(yaml.YAMLObject):
    yaml_tag = u"Foo"
    def __init__(self, a):
        self.a = a

    def __repr__(self, a):
        return str({"a": self.a})

if __name__ == '__main__':
    bar = Foo(5)
    _yaml = yaml.YAML()
    _yaml.register_class(Foo)
    with open("/tmp/foo.yaml", 'w') as stream:
        yaml.dump(bar, stream=stream, Dumper=yaml.RoundTripDumper)

seems to work just fine

Zaharid commented 3 years ago

Do we ever want to do do anything with custom python classes? It seems to me defaults should be simple things that typically live in yaml files in the first place.

wilsonmr commented 3 years ago

That was my first comment

siranipour commented 3 years ago

It's because the way rules are implemented in VP currently is that it returns a list of custom objects. I guess I can add an intermediary production rule that parses the yaml file and a separate one that returns the rule objects

wilsonmr commented 3 years ago

That already exists: load_default_default_filter_settings is just parsing a yaml file. The only difference is that instead of choosing which file to load in there and just returning that list of rules you would instead dump a mapping with all the possible spec s (in that func) which each point to their respective list of rule.

siranipour commented 3 years ago

Yeh sure, but I think it'd be best to scrap some of the old lock file machinery

siranipour commented 3 years ago

Ok I really like this redux. Works a lot better than the old one, and I've got a very reasonable lockfile out that saves the filter rules for now.

I still get that weird clashing of dataset namespaces:

[ERROR]: The key 'dataset' confilcts with a production rule and no parser is defined.

when I run VP on the lock file, but we had a hacky fix to this last time anyway

wilsonmr commented 3 years ago

Yeh sure, but I think it'd be best to scrap some of the old lock file machinery

well that's not lockfile machinery, the procedure of turning a list of rules into classes is something else. The lockfile mechanism concerns with the list of rules which are parsed not the end result.

siranipour commented 3 years ago
        @record_from_defaults
        def produce_default_grouping(self):
            loaded_defaults = dict(
                standard_report=yaml.safe_load(
                    read_text(validphys.defaults, "standard.yaml")),
                theory_covariance_report=yaml.safe_load(
                    read_text(validphys.defaults, "theorycov.yaml")),
            )
            return loaded_defaults

which might be relevant for the filter rules because they're quite long so you could load seperate files. In this case I'm not sure a single string is a valid yaml file but imagine they were lists like the filter rules.

TL;DR: The only fixed thing here is that the production rule returns a mapping with all defaults in, it doesn't matter how they were loaded.

So one thing that we've been using the lock files system for is to use our own personal cuts for separate projects. We create a new defaults file say personal_project_cuts.yaml inside validphys.cuts and we simply specify in the run card to use the new cuts file. This is a feature I quite like, instead of having to alter the config.py file

wilsonmr commented 3 years ago

Minimally I suppose one can introduce that hacky approach, alternatively we can set the parse function inside this wrapper so the hacky fix is hidden from view.

There is something sub optimal about the current processing of rules and I think that holding rules in a mapping would make certain operations easier. Then if I have user defined rules which is for some subset of datasets, and the defaults for all datasets then I could have

def produce_processed_rules_dataset(self, dataset, default_rules, explicit_rules, theoryid, defaults):
    ...
    if str(dataset) in explicit_rules:
        return Rule(
                    initial_data=explicit_rules[str(dataset)],
                    defaults=defaults, # these should be renamed considering we will have many defaults
                    theory_parameters=theory_parameters,
                    loader=self.loader,
                )
    else:
        return Rule(
                    initial_data=default_rules[str(dataset)],
                    defaults=defaults,
                    theory_parameters=theory_parameters,
                    loader=self.loader,
                )

this would mean there wasn't a conflict I think, would simplify extracting the rule for single dataset and also allow you to explicitly override the defaults for a subset of datasets without having to copy and paste all rules.

Whilst we're altering this, it's clear that calling something defaults is not very future proof so we should probably also change that.

wilsonmr commented 3 years ago

So one thing that we've been using the lock files system for is to use our own personal cuts for separate projects. We create a new defaults file say personal_project_cuts.yaml inside validphys.cuts and we simply specify in the run card to use the new cuts file. This is a feature I quite like, instead of having to alter the config.py file

Well if personal_project_cuts.yaml is dynamic enough then I'd argue it should have been defined in the runcard using explicit rules system. If it's static enough then you can add it as a proper default and update the available defaults which requires updating the config once. Even if they are relatively dynamic you only need to add to config once and you can modify the personal_project_cuts.yaml and each lockfile along the way will allow you to reproduce old results.

wilsonmr commented 3 years ago

Sure I guess I'm saying that whilst the defaults can be either a mapping, a list or even a single string whatever it is shouldn't conflict with production rules which the rules currently do, we can introduce a way of working around this but perhaps shouldn't rule out changing the format of the rules. I think that issue is project specific and nothing to do with reportengine though so I'll open an issue in NNPDF/nnpdf

Zaharid commented 3 years ago

Yes, I don;t believe the hacky approach is going to cut it. So it seems to me we can either:

Yep looking into it literally rn, it's looking more intuitive on first glance, just playing with the VP rules implementation

I wouldn't put too much effort into integrating that until this is more stable. But thanks for looking

On the contrary, I think that was the mistake to an extend the first time around: Seems to me we should match try to simultaneously work on support for filters and ideally dataset defaults to inform the implementation (and certainly have it before we all forget how the thing was supposed to be used...).

wilsonmr commented 3 years ago

This current system is supposed to be:

not allow any keys as discussed last week

My comment on "I wouldn't put too much effort into integrating that until this is more stable" was more because the way in which I take the defaults from the config if possible is possibly part of what is causing the key conflict since I'm using parse_from_ once we are sure the defaults are being found properly then of course we should check that this works with the thing we intend it to work with.

wilsonmr commented 3 years ago

I have opened NNPDF/nnpdf#1105 to slightly expand my thoughts on how the cuts could be processed.

Zaharid commented 3 years ago

I am starting to think that the current system (as in the one in this PR) has the better ratio of complexity to usefulness. It is slightly inconvenient to get a large blob of settings every time you use a dataset but on the other hand the situations where it matters are mostly theoretical as far as I can tell. Besides we can always (re-)add some of the complexity latter.

wilsonmr commented 3 years ago

Yeah I think I agree. I was being stupid yesterday and thought there was basically 1 cut for each dataset and 1 cut for each process type but I realise that's not true and that dumping the full list of rules probably works as well as anything in this case.

As you say if we start with this we can always expand: I think it's compatible with the concept of later using args as keys in some nested mapping, which I think becomes more desirable with the dataset defaults. Although again maybe we simply don't care about just dumping all the defaults.

In terms of the conflict between the production rule and the key in the filter rules I wondered in NNPDF/nnpdf#1105 if the easiest option might be to just change the name of the key? It seems more robust than the hacky workaround and AFAICT the key name in the filter rules is not particularly special - provided the Rule class is updated to reflect it. Obviously this would be inconvenient for those with custom rules although find and replace should remedy it fairly easily..

wilsonmr commented 3 years ago

Ok I'm not sure what you think of that but I wasn't a huge fan using try to load the defaults. Perhaps there is a better way of doing this than accessing _curr_input

Zaharid commented 3 years ago

Hey, the tests of this repo seem to catch actual issues! And I am fine with this approach.

wilsonmr commented 3 years ago

Yeah! I will fix the tests.

Then I don't mind doing the corresponding PR that fixes the data grouping lockfile stuff, I guess we can keep the cuts PR separate but that will also need to be changed and perhaps we should at least start a dataset defaults draft to see how this works with that before merging this.

@siranipour is the issue with dataset resolved now that I'm not using parse_from_ with the recorded defaults?

siranipour commented 3 years ago

Yep works a treat now!! thanks

wilsonmr commented 3 years ago

great I will let you try and update NNPDF/nnpdf#818 - remember that you will have to decorate the rule which produces all defaults (both the nnpdf31 ones in the lockfile location and the "current" ones which can probably be called 40 or whatever) and the function which produces the key which is the default set of defaults to use if the user doesn't provide one. I'll let you come up with the names for these 😆

but then schematically you'll have a signature with

default_filter_rules, # all the default rules loaded
default_filter_rules_epoch, # the default key which pulls the right set of defaults from default_filter_rules (probably rename)
explicit_filter_rules=None, # if the user passes their own rules these take precendence
explicit_filter_rules_tag=None # if the user wants different set of defaults this takes precendence over default_filter_rules_epoch

Then similar with the things that are currently referred to as defaults but probably want to be like cut_kinematics or something a bit more descriptive?

wilsonmr commented 3 years ago

stupid question but do we actually need a mapping called filter_defaults which contains q2min and w2min or could we just have defaults for those two keys independently?

@record_from_defaults
def produce_default_q2min(self):
    return <single value or mapping> # seems to me that we probably could just return the current default value here and don't need an epoch? How often do these change?

EDIT: we can discuss this on the respective PR, just a thought though

siranipour commented 3 years ago

Ok lovely many thanks for this PR Mikey. I think I'll make a fresh PR for a nice clean commit history

siranipour commented 3 years ago

stupid question but do we actually need a mapping called filter_defaults which contains q2min and w2min or could we just have defaults for those two keys independently?

@record_from_defaults
def produce_default_q2min(self):
    return <single value or mapping> # seems to me that we probably could just return the current default value here and don't need an epoch? How often do these change?

EDIT: we can discuss this on the respective PR, just a thought though

Think the idea was to have the defaults.yaml file in case we wanted to add more global parameters for the cuts in the future