[CLI] Add include other config files feature when parsing config file to LightningCLI

🚀 Feature

Add ability to include other config file when we use LightningCLI and config arg.

Motivation

If we want to train and test many models with same/many datasets, we may want to split the config file into many parts like trainer part, model part and dataset part. And then, the command to run our model will look like this:

python tools/cil.py --config configs/default_runtime.yaml --config configs/datasets/xxx.yaml --config configs/models/xxx.yaml [--config some/thing/else.yaml] [--some.other thing]

LightningCLI allow us to include many config files in the same time, but the command will be too long to write. Therefore, we can write another config file looks like:

configs/runs/xxx.yaml:

__base__:
    - ../default_runtime.yaml
    - ../datasets/xxx.yaml
    - ../models/xxx.yaml

some:
    other: thing

Then, the command to run our model will look like:

python tools/cil.py --config configs/runs/xxx.yaml

It will be easier to write and remeber.

Alternatives

in pytorch_lightning/utilities/cli.py line 125

class LightningArgumentParser add argument --config with action ActionConfigFile from jsonargparse package, and we need override this action.

def deep_update(source, override):
    """
    Update a nested dictionary or similar mapping.
    Modify ``source`` in place.
    """
    if isinstance(source, Dict) and isinstance(override, Dict):
        if '__delete__' in override:
            delete_keys = override.pop('__delete__')
            if isinstance(delete_keys, str):
                delete_keys = [delete_keys]

            if isinstance(delete_keys, list):
                for k in delete_keys:
                    if k in source:
                        source.pop(k)
            elif delete_keys:
                return override
        for key, value in override.items():
            if isinstance(value, Dict) and key in source:
                source[key] = deep_update(source[key], value)
            else:
                source[key] = override[key]
        return source
    else:
        return override

def parse_dict_config(parser, cfg_file, cfg_path = None, **kwargs):
    if '__base__' in cfg_file:
        sub_cfg_paths = cfg_file.pop('__base__')
        if sub_cfg_paths is not None:
            if not isinstance(sub_cfg_paths, list):
                sub_cfg_paths = [sub_cfg_paths]
            sub_cfg_paths = [os.path.join(os.path.dirname(cfg_path), sub_cfg_path) if not sub_cfg_path.startswith(
                '/') or cfg_path is None else sub_cfg_path for sub_cfg_path in sub_cfg_paths]
            sub_cfg_file = {}
            for sub_cfg_path in sub_cfg_paths:
                sub_cfg_file = deep_update(sub_cfg_file, parse_path(parser, sub_cfg_path, **kwargs).as_dict())
            cfg_file = deep_update(sub_cfg_file, cfg_file)
    if '__import__' in cfg_file:
        cfg_file.pop('__import__')

    for k, v in cfg_file.items():
        if isinstance(v, dict):
            cfg_file[k] = parse_dict_config(parser, v, cfg_path, **kwargs)
    return cfg_file

def parse_config(parser, cfg_file, cfg_path = None, **kwargs):
    return parser._apply_actions(parse_dict_config(parser, cfg_file.as_dict(), cfg_path, **kwargs))

def parse_path(parser, cfg_path, seen_cfg = None, **kwargs):
    abs_cfg_path = os.path.abspath(cfg_path)
    if seen_cfg is None:
        seen_cfg = {}
    elif abs_cfg_path in seen_cfg:
        if seen_cfg[abs_cfg_path] is None:
            raise RuntimeError('Circular reference detected in config file')
        else:
            return seen_cfg[abs_cfg_path]

    cfg_file = parser.parse_path(cfg_path, **kwargs)
    seen_cfg[abs_cfg_path] = None
    cfg_file = parse_config(parser, cfg_file, cfg_path = cfg_path, seen_cfg = seen_cfg, **kwargs)
    seen_cfg[abs_cfg_path] = cfg_file
    return cfg_file

def parse_string(parser, cfg_string, **kwargs):
    return parse_config(parser, parser.parse_string(cfg_string, **kwargs), **kwargs)

class LightningActionConfigFile(ActionConfigFile):
    @staticmethod
    def apply_config(parser, cfg, dest, value) -> None:
        with _ActionSubCommands.not_single_subcommand():
            if dest not in cfg:
                cfg[dest] = []
            kwargs = {'env': False, 'defaults': False, '_skip_check': True, '_fail_no_subcommand': False}
            try:
                cfg_path: Optional[Path] = Path(value, mode = get_config_read_mode())
            except TypeError as ex_path:
                try:
                    if isinstance(load_value(value), str):
                        raise ex_path
                    cfg_path = None
                    cfg_file = parse_string(parser, value, **kwargs)
                except (TypeError,) + get_loader_exceptions() as ex_str:
                    raise TypeError(f'Parser key "{dest}": {ex_str}') from ex_str
            else:
                cfg_file = parse_path(parser, value, **kwargs)
            cfg[dest].append(cfg_path)
            cfg.update(cfg_file)

Additional context

The code of pop 'import' from config file is for easier use of reference of yaml file. We can write the yaml file more clearly using it. An example is when we use the dataset, we may want to define the transforms in the config files, too. And by this little trick, we can make the transforms more clearly like:

__import__:
   import_transform_2: &import_transform_2
         some: thing
    import_train_transforms: &import_train_transforms
        - transform_1
        - *import_transform_2
        - transform_3
    import_test_transforms: &import_test_transforms
        - transform_1
        - *import_transform_2
        - transform_3

data:
    train_transforms: *import_train_transforms
    test_transforms: *import_train_transforms

We can do it without this feature in this little case too, but the config file may be ugly and misleading.

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @borda @carmocca @mauvilsa

Hi @shenmishajing, thanks for the thoughtful feature request.

This proposal fits better in jsonargparse, the library that the LightningCLI uses for parsing.

You can open the same request in their GitHub repository: https://github.com/omni-us/jsonargparse/issues

@shenmishajing indeed if a proposal is not specific to what LightningCLI does, i.e. it applies in general to other parsing use cases, then it should be a request in the jsonargparse repo. Having said this, I do have a few comments:

LightningCLI already has options that support multiple config files.
- Different sections can be given independently, i.e. cli.py --config default_runtime.yaml --data datasets/xxx.yaml --model models/xxx.yaml.
- Different sections can be referenced in a common config file, i.e. a command as cli.py --config config.yaml with a corresponding yaml as
```
data: datasets/xxx.yaml
model: models/xxx.yaml
...
```
Having additional keywords such as __base__ and __import__ steepens the learning curve and in my opinion makes it harder to write and remember.
The motivation seems a bit unfounded. Why is it a problem that the command is long? Is it the effort to write this command? In a shell normally there is history and standard keyboard short cuts that help modifying previous commands with little effort. Furthermore, the command itself could be part of a shell script. Modifying part of the command in a shell script in terms of effort is not greater than modifying it in an additional config file.

Having different ideas from people to improve things is great, so thank you for posting. I hope you do propose improvements in jsonargparse or here. My comments are intended to be constructive and give you a broader perspective.

@mauvilsa thanks for your reply. I will create a pr in jsonargparse repo. And for your comments:

When we do some research, we may design a model and run it with many combines of hyper-params. hyper-params are similar to each other, but some part of them are different. So if we use the raw CLI, we have to write many config files for every combine of hyper-param(otherwise, we have to change some params in shell command. It will be tiresome when there are more and more params), then if we change some common params, we have to change every config files. But we have the __base__ feature, we can create a common config file as a base file, and use __base__ feature to write the other combine of params, then we only have to change the base config file if we want to change some common param. It will make less mistake in config.

As for this feature steepens the learning curve, we can do not use it, if we do not have the motivation to use it. So, I do not think it will steepens the learning curve or make more mistake etc.

Thanks for your reply and suggestion again.

@shenmishajing I described two already existing features that to my understanding already support your need. Not sure if you didn't understand how to use them of if there is some detail about your use case which I didn't understand and makes these not practical. Please ask if more clarification is needed or explain better your use case. Introducing new special keywords in the config will likely not happen unless there is a compelling reason which has been very clearly explained and motivated and is not already achievable with already existing features.

@mauvilsa I have already understood the two way you described, but them do not support my need. I will give some use case.

Use Case 1 contrast experiment

Contrast experiment is a common use case in research. In general, we have to run many experiments with same hyper-params except only one or a little bit params. It will looks like this:

Run 1

model:
    param1: value1
    common: part
    some_thing: else

Run 2

model:
    param1: value2
    common: part
    some_thing: else

Run 3

model:
    param1: value3
    common: part
    some_thing: else

But, after this contrast experiment, we also may want to change some param to rerun this experiment some times, casued by that some params we set are not the best param or we want to add another experiment. Therefore, we want the config files look like this:

Run 1

model:
    param1: value1
    common: part_with_different_value
    some_thing: else

Run 2

model:
    param1: value2
    common: part_with_different_value
    some_thing: else

Run 3

model:
    param1: value3
    common: part_with_different_value
    some_thing: else

But, if there are many runs, it will be fallible for us to change every config files, especially when we have many contrast experiments, or we want to rerun an experiment several weeks ago because the config file do not have hierarchical structure and inheritance information.

But, by the __base__ key word, we can write config files like this:

Base

model:
    param1: value1
    common: part
    some_thing: else

Run 1

__base__: ./Base.yaml
model:
    param1: value1

Run 2

__base__: ./Base.yaml
model:
    param1: value2

Run 3

__base__: ./Base.yaml
model:
    param1: value3

Therefore, if we want to change the common part, we only have to change the Base.yaml config file, it will be less probability to make mistake. And we can get the purpose of the constract experiment even after several weeks, because config files have hierarchical structure and inheritance information, the params different is written in every run config files clearly.

P.S. param1 and common keys are just a represtation, they may be several keys in practice, so writing them in command line is not a good idea.

Use Case 2 dataset or model has multi mode

There are also some cases, in which our datasets or models have multi modes, like grayscale or RGB dataset, align or not align multi modality data. So we may define the dateset and models like this:

DataSet_grayscale

dataset:
    color_type: grayscale
    common: part
    some_thing: else

DataSet_RGB

dataset:
    color_type: RGB
    common: part
    some_thing: else

Also, if we want to change the common part, we have to change the two config files. If we have to add some new params, we have to change the two config files, like this:

DataSet_grayscale

dataset:
    color_type: grayscale
    new_param: new_value
    common: part_with_different_value
    some_thing: else

DataSet_RGB

dataset:
    color_type: RGB
    new_param: new_value
    common: part_with_different_value
    some_thing: else

By __base__keyword, we can write them like this:

Base

dataset:
    color_type: grayscale
    common: part
    some_thing: else

DataSet_grayscale

__base__: ./Base.yaml
dataset:
    color_type: grayscale

DataSet_RGB

__base__: ./Base.yaml
dataset:
    color_type: RGB

And we only have to change the Base.yaml to change the common part and add the new param:

Base

dataset:
    color_type: grayscale
    new_param: new_value
    common: part_with_different_value
    some_thing: else

It will be less probability to make mistake.

Use Case 3 reuse config file

This use case is not general as the first two, but it is very cool I think.

Sometimes, we want to try out different backbones for our model or we want to use an existing model as a part of our model, or we want to use a part of an existing model as a part of our model. We can do it with __base__ keyword.

ResNet.yaml

model:
    class_path: models.ResNet50
    init_args:
        in_channels: 3

Our model

model:
    backbone:
        __base__:
            - [ ./ResNet.yaml, model ]

I have not push the code of this part to github now, if you are instereted in it, I will update it soon.

Summary

In summary, the __base__ keyword construct the hierarchical structure and inheritance information of config files, it help us to get the purpose of every config file as soon as possible, and help us make less mistake.

And, the reason why current feature can not support my need is that they can not change some part directly.

If we use the way as cli.py --config default_runtime.yaml --data datasets/xxx.yaml --model models/xxx.yaml, we have to write the every param we want to change in command, it will be failible and weary.

If we use a single config file like:

data: datasets/xxx.yaml
model: models/xxx.yaml
...

We can not change some param in this file, which means we can not write files like:

data: datasets/xxx.yaml
model: models/xxx.yaml
    param1: a_new_value
...

So, these two ways will lead to write many and many config files without hierarchical structure and inheritance information, which means hard to manage them and get the purpose of each single file after several days.

@shenmishajing what you want is already supported.

Use Case 1

You would have a base config like:

# Base.yaml
model:
    param1: value0
    common: part
    some_thing: else

Then you have a second config which overrides a few of the values from the base, for example:

# Run.yaml
model:
    param1: value1

The command would be run as:

cli.py --config Base.yaml --config Run.yaml

Regarding your comments:

Therefore, if we want to change the common part, we only have to change the Base.yaml config file, it will be less probability to make mistake.

You can do the same with what is already supported.

And we can get the purpose of the constract experiment even after several weeks, because config files have hierarchical structure and inheritance information, the params different is written in every run config files clearly.

I am not entirely sure what you mean. If you change the Base.yaml file across different runs and if for each run you save its run yaml which had the __base__: ./Base.yaml then it is not possible to know what version of Base.yaml was used. Also note that in LightningCLI the full config is always saved automatically in the log directory. This is what can really be trusted to know what was done for each run. Comparing runs is just simple diffs of the configs.

Use Case 2

Your second use case for the dataset seems to be the same as the first.

Use Case 3

This is also already supported, though it might depend on what exactly you are doing.

Assuming that your model has a backbone init parameter, similar to the first use case you would reference another config:

model:
    backbone: ResNet.yaml

And the file ResNet.yaml would look like:

class_path: models.ResNet50
init_args:
    in_channels: 3

The LightningCLI has the parameter save_config_multifile which if set to True when saving the config to the log directory it will preserve this structure instead of being a single big config.

@mauvilsa I used this way before, but it is failible also. The different between inhert config file in file and in command is that we have no to remeber which config file should inhert from which config file.

Use Case 1 and 2

There will be many hyper-params for a single model, so we will design several experiments to find the best combine of hyper-params. If we do not define the inhert relation in config file, the run command will look like this:

# first experiment
cil.py --config models/base.yaml --config models/experiments_1/run_1.yaml

# second experiment
cil.py --config models/base.yaml --config models/experiments_1/run_best.yaml --config models/experiments_2/run_1.yaml

The third experiment may be some params different.

# third experiment
cil.py --config models/base.yaml --config models/experiments_1/run_different.yaml --config models/experiments_2/run_best.yaml --config models/experiments_3/run_1.yaml

and so on. If we add the config part of dataset and trainer and other things, the command will be too long to edit and too much inhert structure to remeber.

Use case 3

In fact, I want to make the ResNet.yaml also a whole config file to run. A more clear use case may be like this:

Two mode with same backbone:

# model_1
model:
    backbone:
        class_path: models.ResNet50
        init_args:
            in_channels: 3
    some_thing: else

# model_2
model:
    backbone:
        __base__:
            - ./model_1.yaml: model.backbone
    another: things

Use case 3

Do this as follows:

# ResNet.yaml
class_path: models.ResNet50
init_args:
    in_channels: 3

# model_1
model:
    backbone: ResNet.yaml
    some_thing: else

# model_2
model:
    backbone: ResNet.yaml
    another: things

One person wanting to do it differently even though there are similar alternatives is not enough to add it as a new feature. Even more if this introduces new syntax such as ./model_1.yaml: model.backbone. You are free to open this as a feature request in jsonargparse, but it will not be worked on or merged unless there is a large group of people wanting this despite existing features.

Use Case 1 and 2

Now I see that your use case is that you have an increasing number of config files that should be applied one after the other. But you want to specify them in a config file so that the number of --config command line arguments does not increase. You can create this as a feature request in jsonargparse. But don't mix this with the Use case 3.

By the way, what you were doing in the pull request referenced in this issue is not the way how this would be implemented. It shouldn't be a modification of the ActionConfigFile class. Also a merge of configs is not a simple deep merge of dicts, this will break some jsonargparse features. In reality this is more complex than what you imagine.

@mauvilsa thanks for your reply and suggestion, I will create a fr in jsonargparse.

Is there a way to support adding args to objects that exist in a list in a second yaml file?

for example if I have a lightningCLI trainer with a default config

LightningCLI(default_config_file = default.yaml)`

and a dataset class like this:


class DataSetClass:

    def __init__(path_to_data = Path)

and the default config looks something like

trainer:
...
datamodule:
   datasets:
        - mydatasets.DataSetClass

normally I can modify the init args of this DataSetClass in the default yaml via:

cli.py fit --datamodule.datasets.path_to_data /path/to/data

but I am trying to figure out how I can write out the --datamodule.datasets.path_to_data arg in a second yaml file so it can get written into the args in the list's data object

It is possible to concatenate to a list, see list-append docs.

Lightning-AI / pytorch-lightning