auto-optimization / iracepy

Using irace from Python
GNU General Public License v3.0
8 stars 5 forks source link

Swaping the irace instances with python objects #29

Open DE0CH opened 1 year ago

DE0CH commented 1 year ago

Since the number of things that can be converted into r objects and back is limited, I think it would be better if we just get the original python object from the other end using id.instance. As for the instances given to irace, I see three options:

  1. Replaced them with integers. The downside is that it will be more opaque what the instances are when someone is examining the log.
  2. Replace them with the json string for the object with json.dumps(obj, skipkeys=True, default=lambda x: '<not serializable>'), which would skip all the non-serializable objects.
  3. Convert it to equivalent r objects with best-effort. A hacky way to do it is to just convert it to json string and back. Or we can write our own recursive converter.

I would prefer the second option because it's the easiest to implement and human-readable in a log. For any missing data, we'll just ask the user to find the data somewhere else. For the third option, it seems like a lot of effort for not many benefits.

MLopez-Ibanez commented 1 year ago

I don't understand why this needs to be complicated in this way. Instances are strings (or optionally numbers) and those strings/numbers could represent the names of files or parameters for some generator or keys/index to some dictionary of objects... If the user wants to make them json strings, they can do it themselves.

If the user passes as instance something that is not a string nor a number, I would suggest to give a clear error.

DE0CH commented 1 year ago

It's easier and more convenient if they can pass plain old / not serializable python objects as instances. For example, in the irace tuning experiment, I am passing the pyrfr model straight through as an instance. This is a lot easier than passing a dictionary key or file path.

MLopez-Ibanez commented 1 year ago

It's easier and more convenient if they can pass plain old / not serializable python objects as instances. For example, in the irace tuning experiment, I am passing the pyrfr model straight through as an instance. This is a lot easier than passing a dictionary key or file path.

But the model is not an instance: it doesn't need to be randomized, nor selected by irace, it always stays the same, so there is no point in passing it through irace as an instance. It should be part of the context available to the target runner, either as a global variable, via a closure, as part of an object or a functor (https://www.stevenengelhardt.com/2013/01/16/python-multiprocessing-module-and-closures/).

Perhaps even easier (at least in R) is to use scenario$targetRunnerData to pass pyrfr around, since it is not touched by irace and it should never be copied or duplicated, so you can put anything there! (targetRunnerData is one of the elements in scenario that should not be converted between R/python)

Allowing arbitrary objects as instances means that irace would need to be extra careful with copying or duplicating them (issues with shallow copies vs deep copies), before printing the instances, saving them to disk, etc. Lots of things can go wrong there!

Thus, even if there were multiple models (or multiple dataframes or multiple 4D matrices or multiple networkx graphs) and each model was an "instance", it would be better to have the list of models stored as context or targetRunnerData and pass as instances just a range of integers so that irace can select a model from the list.

DE0CH commented 1 year ago

But the model is not an instance: it doesn't need to be randomized, nor selected by irace, it always stays the same, so there is no point in passing it through irace as an instance. It should be part of the context available to the target runner, either as a global variable, via a closure, as part of an object or a functor (https://www.stevenengelhardt.com/2013/01/16/python-multiprocessing-module-and-closures/).

But different instances of the meta irace use different models (algorithms). Also the training / test instance set split is also part of the instance. I could just pass in a string key which I then lookup with a dictionary in a global varaible but that's more complicated to write.

it would be better to have the list of models stored as context or targetRunnerData and pass as instances just a range of integers so that irace can select a model from the list.

I think this is similar to that I am doing now. But instead of passing range of integers it just passes the json serialized string https://github.com/DE0CH/iracepy/blob/ffff0b1fea077e997940d8808d1f8bfce0c075f9/src/irace/__init__.py#L158. This makes the log more human readable.

MLopez-Ibanez commented 1 year ago

But different instances of the meta irace use different models (algorithms). Also the training / test instance set split is also part of the instance. I could just pass in a string key which I then lookup with a dictionary in a global varaible but that's more complicated to write.

I guess we have different ideas of what complicated is, because using json to serialize/unserialize python objects into strings seems more complicated to me than using a simple dictionary which may be 2-3 lines of code without extra dependencies. Nevertheless, it seems to me you can still do it the way you want it without having to do it within irace. See below.

I think this is similar to that I am doing now. But instead of passing range of integers it just passes the json serialized string https://github.com/DE0CH/iracepy/blob/ffff0b1fea077e997940d8808d1f8bfce0c075f9/src/irace/__init__.py#L158. This makes the log more human readable.

If you really want to do it in that way, why not do it before creating the irace object so that the instances you give to irace are already strings? This way iracepy doesn't need to know all these details. and for users whose instances are simple strings or numbers, they don't need json nor serialization and they don't pay the cost of this complexity.

You may even add an example or a test of json serialization outside irace so it can be useful to others if they want to do that.

DE0CH commented 1 year ago

I think there might be a misunderstanidng. My idea is like the following (python mixed with English pseudocode)

def make_target_runner(context):
    def target_runner(experiment, configuraiton):
        ...
        if the user provided instances is not a list of strings or numbers:
            experiment['instance'] = context['instances'][int(experiment['instance.id']) - 1]
        ...
    return target_runner
...
class Irace:
    def __init__(self, scenario, parameters_table, target_runner):
        ...
        if 'instances' in scenario:
            if instances is not a list of strings or numbers:
                self.scenario['instances'],  context['instances'] = np.arrange(len(scenario['instance'])), scenario['instance']
        ...

Essentially, if the user gives us custom objects, we just don't pass it to irace but we just keep track of it in iracepy. If we don't do this, the user has to do this themselves whcih is not as user friendly.

On top of this, we can make the log more human readable by using StrVector(list(map(lambda x: json.dumps(x, skipkeys=True, default=lambda x: '<not serializable>'), self.scenario['instances']))) instead of np.arrange(len(scenario['instance'])). We are not actually depending on json to pass data through. If something is not serializable it will jsut be replaced by a big fat <not serializalbe>.