ZhiyuanChen / CHANfiG

Easier Configuration
https://chanfig.danling.org
Other
30 stars 3 forks source link

Feature Request: variable interpolation in yaml file #14

Closed dc3ea9f closed 1 year ago

dc3ea9f commented 1 year ago

Hi, I just found there exists an exciting and very useful feature "variable interpolation" provided by OmegaConf .

Detailed documents can be found here. And here is a comparison:

image

In short, it can interpolate values that are annotated with ${} in lazy mode. I think this reference is usable when creating configurations from yaml, just like Variable in chanfig.

How do you like it?

Best!

ZhiyuanChen commented 1 year ago

Thank you for opening this.

This is indeed a very important feature and was on the TODO list since last year.

I believe this was postponed because we need to wait pyyaml for anchor and alias implementation. I'll double check their progress this week.

dc3ea9f commented 1 year ago

Aha! This is a feature of pyyaml rather than omegaconf, through omegaconf implemented it.

Thank you for your information.

You can close this issue since it should be solved by pyyaml.

ZhiyuanChen commented 1 year ago

I double checked the progress.

Now, pyyaml has full support with alias and anchor

So, the following config is working:

test.yaml:

a: &a 1
b: *a
>>> import chanfig
>>> chanfig.load('test.yaml')
NestedDict(
  ('a'): 1
  ('b'): 1
)

However, the ${} is not described in yaml standard and so will not be supported by pyyaml. Please let me know if anchor and alias facility your needs.

dc3ea9f commented 1 year ago

Thank you for your efforts.

While I think the anchor and alias are weaker than the value interpolation. I think the implementation of value interpolation is essential, it can help us build more flexible configurations.

ZhiyuanChen commented 1 year ago

No worries, they are indeed very different. anchor and alias are great for longer duplications, while ${} is great for Variables. In the next release, we will add a format function to implement this behavior.

dc3ea9f commented 1 year ago

Nice job! Looking forward to your updates.

================

Mohan Zhou Ph.D. Student, Harbin Institute of Technology https://www.mhzhou.com/


From: Zhiyuan Chen @.> Sent: Friday, July 21, 2023 4:34:09 PM To: ZhiyuanChen/CHANfiG @.> Cc: Mohan Zhou @.>; Author @.> Subject: Re: [ZhiyuanChen/CHANfiG] Feature Request: variable interpolation in yaml file (Issue #14)

No worries, they are indeed very different. anchor and alias are great for longer duplications, while ${} is great for Variables. In the next release, we will add a format function to implement this behavior.

— Reply to this email directly, view it on GitHubhttps://github.com/ZhiyuanChen/CHANfiG/issues/14#issuecomment-1645215984, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHWHJ7VUUZ7MERVMCYRULBLXRI5IDANCNFSM6AAAAAA2HMX52U. You are receiving this because you authored the thread.Message ID: @.***>

ZhiyuanChen commented 1 year ago

Variable interpolation is available at the master branch. Please let me know if you like it.

If everything works fine, we will be releasing v0.0.88.

dc3ea9f commented 1 year ago

Thank you for your efforts.

However, I found the interpolation only takes effect when the value is ${}, as this line does. It is just like anchor/alias in yaml, and cannot handle things like "http://${server.host}:${server.port}".

You can test with this script:

from chanfig import Config
from rich import print
import io

yaml = """server:
  host: localhost
  port: 80

client:
  url: http://${server.host}:${server.port}/
  server_port: ${server.port}
  description: Client of ${.url}
"""

if __name__ == '__main__':
    yaml = io.StringIO(yaml)
    config = Config.from_yaml(yaml)
    print(config)
    print(config.interpolate())

The result is:

Config(<class 'chanfig.config.Config'>,
  ('server'): Config(<class 'chanfig.config.Config'>,
    ('host'): 'localhost'
    ('port'): 80
  )
  ('client'): Config(<class 'chanfig.config.Config'>,
    ('url'): 'http://${server.host}:${server.port}/'
    ('server_port'): '${server.port}'
    ('description'): 'Client of ${.url}'
  )
)
Config(<class 'chanfig.config.Config'>,
  ('server'): Config(<class 'chanfig.config.Config'>,
    ('host'): 'localhost'
    ('port'): 80
  )
  ('client'): Config(<class 'chanfig.config.Config'>,
      ('url'): 'http://${server.host}:${server.port}/'    <--- ERROR, expected 'http://localhost:80/'
    ('server_port'): 80    <--- Fine
    ('description'): 'Client of ${.url}'    <--- ERROR, expected 'Client of http://localhost:80/'
  )
)

We can find only client.server_port is interpolated, and client.url and client.description are not as expected.

Given the example before, we may need to find all value that contains ${}, rather than equal to ${}. In your implementation, the variable interpolation's functionality is strongly limited (why I don't just use anchor and alias?). Given that values need to be interpolated, a parse tree or recursively implementation may be involved to resolve the true value. When implementing this, please pay attention to allowing escape characters, e.g. \${} won't be interpolated.

Besides, I am also thinking about how to maximize the performance of variance interpolation. A possible solution is we can predefine some interpolates and the user can select which to use. And we need to expose the interpolate function to Config.from_yaml.

E.g.:

from chanfig import Config
from chanfig.interpolators import (
    DefaultInterpolator,
    EnvInterpolator,
    LambdaInterpolator,
    LambdaRegexInterpolator,
)

config = Config.from_yaml(yaml)  # would use DefaultInterpolator
config = Config.from_yaml(
    yaml, interpolate=[DefaultInterpolator, EnvInterpolator]
)  # would handle default and ${Env.xxx} with environment variables
config = Config.from_yaml(
    yaml,
    interpolate=[
        DefaultInterpolator,
        EnvInterpolator,
        LambdaInterpolator(cond, mapping),
    ],
)  # would handle default, env and user customed interpolator

By the way, I think when people use the ${} in yaml, they mostly expect variable interpolation rather than the raw value. So the interpolate can be set to default behavior.

dc3ea9f commented 1 year ago

Another thing, I can only install the repo with pip install git+https://github.com/ZhiyuanChen/CHANfiG. When I tried git clone https://github.com/ZhiyuanChen/CHANfiG && cd CHANfiG, and then python setup.py develop, it fails with ModuleNotFoundError: No module named 'chanfig._version'. I am not familiar with package distribution, can you fix this?

One more thing, when use Config.from_yaml() to load a non-existent file, it would raise the error:

Traceback (most recent call last):
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1174, in open
    file = open(file, *args, encoding=encoding, **kwargs)  # type: ignore # noqa: SIM115
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/xxx/Documents/tmp/test.py", line 7, in <module>
    config = Config.from_yaml('config.yaml')
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1090, in from_yaml
    with cls.open(file) as fp:  # pylint: disable=C0103
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1177, in open
    file.close()  # type: ignore
    ^^^^^^^^^^
AttributeError: 'str' object has no attribute 'close'

The second exception is not expected, we can modify here with:

try:
    file = open(file, *args, encoding=encoding, **kwargs)  # type: ignore # noqa: SIM115
    yield file  # type: ignore
except:
    raise
else:
    file.close()  # type: ignore
ZhiyuanChen commented 1 year ago

Thank you for your efforts.

However, I found the interpolation only takes effect when the value is ${}, as this line does. It is just like anchor/alias in yaml, and cannot handle things like "http://${server.host}:${server.port}".

You can test with this script:

from chanfig import Config
from rich import print
import io

yaml = """server:
  host: localhost
  port: 80

client:
  url: http://${server.host}:${server.port}/
  server_port: ${server.port}
  description: Client of ${.url}
"""

if __name__ == '__main__':
    yaml = io.StringIO(yaml)
    config = Config.from_yaml(yaml)
    print(config)
    print(config.interpolate())

The result is:

Config(<class 'chanfig.config.Config'>,
  ('server'): Config(<class 'chanfig.config.Config'>,
    ('host'): 'localhost'
    ('port'): 80
  )
  ('client'): Config(<class 'chanfig.config.Config'>,
    ('url'): 'http://${server.host}:${server.port}/'
    ('server_port'): '${server.port}'
    ('description'): 'Client of ${.url}'
  )
)
Config(<class 'chanfig.config.Config'>,
  ('server'): Config(<class 'chanfig.config.Config'>,
    ('host'): 'localhost'
    ('port'): 80
  )
  ('client'): Config(<class 'chanfig.config.Config'>,
      ('url'): 'http://${server.host}:${server.port}/'    <--- ERROR, expected 'http://localhost:80/'
    ('server_port'): 80    <--- Fine
    ('description'): 'Client of ${.url}'    <--- ERROR, expected 'Client of http://localhost:80/'
  )
)

We can find only client.server_port is interpolated, and client.url and client.description are not as expected.

Given the example before, we may need to find all value that contains ${}, rather than equal to ${}. In your implementation, the variable interpolation's functionality is strongly limited (why I don't just use anchor and alias?). Given that values need to be interpolated, a parse tree or recursively implementation may be involved to resolve the true value. When implementing this, please pay attention to allowing escape characters, e.g. \${} won't be interpolated.

Besides, I am also thinking about how to maximize the performance of variance interpolation. A possible solution is we can predefine some interpolates and the user can select which to use. And we need to expose the interpolate function to Config.from_yaml.

E.g.:

from chanfig import Config
from chanfig.interpolators import (
    DefaultInterpolator,
    EnvInterpolator,
    LambdaInterpolator,
    LambdaRegexInterpolator,
)

config = Config.from_yaml(yaml)  # would use DefaultInterpolator
config = Config.from_yaml(
    yaml, interpolate=[DefaultInterpolator, EnvInterpolator]
)  # would handle default and ${Env.xxx} with environment variables
config = Config.from_yaml(
    yaml,
    interpolate=[
        DefaultInterpolator,
        EnvInterpolator,
        LambdaInterpolator(cond, mapping),
    ],
)  # would handle default, env and user customed interpolator

By the way, I think when people use the ${} in yaml, they mostly expect variable interpolation rather than the raw value. So the interpolate can be set to default behavior.

Thank you for your feedback! I'll work on it later next week.

ZhiyuanChen commented 1 year ago

Another thing, I can only install the repo with pip install git+https://github.com/ZhiyuanChen/CHANfiG. When I tried git clone https://github.com/ZhiyuanChen/CHANfiG && cd CHANfiG, and then python setup.py develop, it fails with ModuleNotFoundError: No module named 'chanfig._version'. I am not familiar with package distribution, can you fix this?

Have you tried to install with pip install -e .? Running setup.py is deprecated and may resulted in unexpected behavior.

One more thing, when use Config.from_yaml() to load a non-existent file, it would raise the error:

Traceback (most recent call last):
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1174, in open
    file = open(file, *args, encoding=encoding, **kwargs)  # type: ignore # noqa: SIM115
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/xxx/Documents/tmp/test.py", line 7, in <module>
    config = Config.from_yaml('config.yaml')
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1090, in from_yaml
    with cls.open(file) as fp:  # pylint: disable=C0103
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/xxx/anaconda3/envs/dev/lib/python3.11/site-packages/chanfig/flat_dict.py", line 1177, in open
    file.close()  # type: ignore
    ^^^^^^^^^^
AttributeError: 'str' object has no attribute 'close'

The second exception is not expected, we can modify here with:

try:
    file = open(file, *args, encoding=encoding, **kwargs)  # type: ignore # noqa: SIM115
    yield file  # type: ignore
except:
    raise
else:
    file.close()  # type: ignore

Thanks for reporting this. The finally is meant to catch any possible error when opening file and ensure it's closed, to avoid memory leaks (In some cases, the FileNotFoundError may be catched). So it must remains in the finally clause. Although the exception in attempting to close the file should be suppressed. I have just pushed a fix for this.

dc3ea9f commented 1 year ago

Running pip install -e . fix it, thanks!

Would the memory leakage occur even if the file is closed?

from contextlib import contextmanager

class X:
    def __init__():
        pass

    @classmethod
    @contextmanager
    def open(self, filename):
        try:
            f = open(filename, 'r')
            yield f
        except:
            raise
        else:
            f.close()

with X.open("test.yaml") as f:
    print(f.read())
print(f.closed)

Output:

server:
  host: localhost
  port: 80

client:
  url: http://${server.host}:${server.port}/
  server_port: ${server.port}
  description: Client of ${.url}
True
ZhiyuanChen commented 1 year ago

Would the memory leakage occur even if the file is closed?

No, but if open successfully opened a file handle, but also raise an exception, it will goes to except: raise clause. And if this Exception is handled elsewhere, the file would remain opens, leading to memory leakage.

ZhiyuanChen commented 1 year ago

However, I found the interpolation only takes effect when the value is ${}, as this line does. It is just like anchor/alias in yaml, and cannot handle things like "http://${server.host}:${server.port}".

Current implementation (at develop branch) has support this feature. Although there are some drawbacks, since Variable computes eagerly, url=http://${server.host}:${server.port} will not be able to save to a Variable object, meaning that if the value of server.host or server.port changes, url will not change accordingly.

Besides, the interpolation is not raising error with circular interpolation.

Also, I noticed that you are creating an IO to load string.

if __name__ == '__main__':
    yaml = io.StringIO(yaml)
    config = Config.from_yaml(yaml)
    print(config)
    print(config.interpolate())

Did you know you can do it by calling from_yamls?

if __name__ == '__main__':
    config = Config.from_yamls(yaml)
    print(config)
    print(config.interpolate())
ZhiyuanChen commented 1 year ago

Besides, the interpolation is not raising error with circular interpolation.

Just fixed this.

dc3ea9f commented 1 year ago

No, but if open successfully opened a file handle, but also raise an exception, it will goes to except: raise clause. And if this Exception is handled elsewhere, the file would remain opens, leading to memory leakage.

OK, though i think suppress all exception is a little dangerous, I can accept it for now.

Current implementation (at develop branch) has support this feature.

Great! Thank you.

Although there are some drawbacks, since Variable computes eagerly, url=http://${server.host}:${server.port} will not be able to save to a Variable object, meaning that if the value of server.host or server.port changes, url will not change accordingly.

It would be better if we could solve this issue. Can we compute the interpolation lazily when we need it?

Did you know you can do it by calling from_yamls?

Thank you for your advice. It's really better.

By the way, would you plan to implement move the interpolate to default behavior or implement pre-defined interpolators, e.g. ${ENV.local_rank} or ${import ./model.yaml}?

dc3ea9f commented 1 year ago

Got errors when using interpolate with parse.

ZhiyuanChen commented 1 year ago

OK, though i think suppress all exception is a little dangerous, I can accept it for now.

Don't worries, it will only supress all exception when closing the file. It will still arise Exception in opening a file.

It would be better if we could solve this issue. Can we compute the interpolation lazily when we need it?

I have thought about it. I don't think it's a good idea to support lazy interpolation. There are two main reasons:

  1. lazy interpolation will significantly damage the performance.

dict member access is extremely fast in chanfig, at least twice faster in FlatDict and ~10 times faster compare to yacs, ml_collections and addict.

Variable interpolation, on the other hand, is very very complex. We use regex to identify all the placeholders (wait, this is a better name that interpolatee, which is not even a word), this takes O(k) time. Then, we identify if it points to it self, this also takes O(k) time. Next, we check if there exists circular reference, which takes O(n log n) time. They will have significatn impact on the performance of chanfig (and chanfig was designed for extreme use case, where constant terms matter).

  1. lazy interpolation creates inequality.

Lazy interpolation will have "master member" and "slave member". Change the value of "mater member" will also change the value of "slave member", but not the other way. Although I don't care about PC, I mean, I'm a Chineses, my ancestors did nothing to black friends. But this is against my philosophy. Every member should be equal.

To address this issue, in the later version of chanfig, we will introduce a new update on Variable (or maybe a brand new object). This Object will be more similar to FieldReference which supports variable level lazy evaluation.

By the way, would you plan to implement move the interpolate to default behavior or implement pre-defined interpolators, e.g. ${ENV.local_rank} or ${import ./model.yaml}?

Hmmmm, I think I'll go the first way, as pre-defined interpolator is comparably less intuitive.

For the first one, I'll add another parameter to allow read environment variable more easily.

Currently, interpolate() accepts an interpolator parameter. So, you could call interpolate(chanfig.load("model.yaml")). Or, could you provide more example usage?

ZhiyuanChen commented 1 year ago

Got errors when using interpolate with parse.

Just fixed this!

dc3ea9f commented 1 year ago

sorry for the late response, I have just tested with the newest develop branch and found an issue: elements in the list of dict won't be interpolated.

To address this issue, in the later version of chanfig, we will introduce a new update on Variable (or maybe a brand new object). This Object will be more similar to FieldReference which supports variable level lazy evaluation.

I think it is OK. Thank you for your efforts.

Currently, interpolate() accepts an interpolator parameter. So, you could call interpolate(chanfig.load("model.yaml")). Or, could you provide more example usage?

But we cannot write "load model.yaml" in yaml config and load it correctly right? When I am writing configurations for my experiments, I faced this issue, since a model's parameters are large (refer to transformers or diffusers) and relatively fixed, and dataset's parameters can be reused across many experiments, I decided to separate these configuration files. But without grammar like ${import xx.yaml}$, I turned out to write a python config, as shown in #16 .

ZhiyuanChen commented 1 year ago

sorry for the late response, I have just tested with the newest develop branch and found an issue: elements in the list of dict won't be interpolated.

Copy, will work on it later, probably next week.

But we cannot write "load model.yaml" in yaml config and load it correctly right? When I am writing configurations for my experiments, I faced this issue, since a model's parameters are large (refer to transformers or diffusers) and relatively fixed, and dataset's parameters can be reused across many experiments, I decided to separate these configuration files. But without grammar like ${import xx.yaml}$, I turned out to write a python config, as shown in #16 .

Ahhh, I understand your needs.

16 is partially fixed, have you tried it?

For combining multiple yaml file, this is a very solid and sound needs. We will be tracking this in a new issue

ZhiyuanChen commented 1 year ago

sorry for the late response, I have just tested with the newest develop branch and found an issue: elements in the list of dict won't be interpolated.

Could you provide a minimum working example?

I have the following yaml, but it is not parsable by pyyaml.

a: 1
b: ${a}
c: [${a}, ${b}]
dc3ea9f commented 1 year ago

Yes, I tried #16 and use it as the current implementation. While compare to yaml config, python config are too long...

When trying to create the minimum working example, I use the latest dev branch 0.0.88.dev22+g6e303f9, and got an error TypeError: 'Config' object is not callable when calling interpolate, does the interface changed?

ZhiyuanChen commented 1 year ago

Yes, I tried #16 and use it as the current implementation. While compare to yaml config, python config are too long...

I'm working on an api like dataclass, which should improve it a bit.

When trying to create the minimum working example, I use the latest dev branch 0.0.88.dev22+g6e303f9, and got an error TypeError: 'Config' object is not callable when calling interpolate, does the interface changed?

Could you provide a minimum example for this? I don't think there was anything change in interpolate.

dc3ea9f commented 1 year ago

Our server's power supply was flooded, here's a possible example:

data_dir: /home/data
# val_data_dir: /home/data_eval

# data
data_args:
  data_type: "any"
  data_dict:
    imagenet:
      data_dirs:
        - ${data_dir}/X-A
        - ${data_dir}/X-B
        - ${data_dir}/X-C
ZhiyuanChen commented 1 year ago

This is much harder than it appears to be. Could you try the latest develop branch and see if it works as you expected?