BSC-ES / autosubmit-config-parser

Library used to read Autosubmit 4 experiment data.
3 stars 1 forks source link

Improve dict_replace_value performance #14

Open LuiggiTenorioK opened 3 months ago

LuiggiTenorioK commented 3 months ago

In GitLab by @Lerriola on Jul 26, 2024, 10:48

_This idea of improvement arose while profiling the testing suite and identifying that the dict_replace_function performance (same for the list_replace_value) could be improved._

File where the function is found: https://earth.bsc.es/gitlab/es/autosubmit4-config-parser/-/blob/main/autosubmitconfigparser/config/configcommon.py#L562

The change would be to use memoization (adding a small "cache", so we don't go through the same values in the dict).

def dict_replace_value(self, d: dict, old: str, new: str, memo=None) -> dict:
    if memo is None:
        memo = {}

    # Check if we've already processed this dictionary
    if id(d) in memo:
        return memo[id(d)]

    x = {}
    for k, v in d.items():
        if isinstance(v, dict):
            v = self.dict_replace_value(v, old, new, memo)
        elif isinstance(v, list):
            v = self.list_replace_value(v, old, new, memo)
        elif isinstance(v, str):
            v = v.replace(old, new)
        x[k] = v

    # Remember this result
    memo[id(d)] = x
    return x

cc: @dbeltrankyl

LuiggiTenorioK commented 3 months ago

In GitLab by @dbeltrankyl on Aug 5, 2024, 10:31

Hello @Lerriola ,

Thanks for the suggestion. I have to check it, but I think this is unnecessary.

This function what it does is to replace a placeholder for another value. And the dict is the yaml structure, so it is recursive to go deeper in the dictionary.

Example:

Jobs:
 job:
   test: %test.a%
test:
 a: %test.b%
 b: "b"

What the code does:

1) iteration values

k =[jobs(dict),test(dict)]

v =[job(dict),a(str),b(str)]

2) iteration, parses jobs

k= [job(dict)] # note that K is V of previous iteration

v= [test(str)=%test.a%]

new_v = [test(str)=%test.b%]

3) iteration, parses test

k =[jobs(dict),test(dict)]

v =[job(dict),a(str)=%test.b%,b(str)="b"]

new_v = [job(dict),a="b",b="b"]

4) iteration, still on test

k =[jobs(dict),test(dict)]

v =[job(dict),a(str)="b",b(str)="b"]

new_v = "no changes"

End result:

Jobs:
 job:
   test: %test.b%
test:
 a: "b"
 b: "b"

So, as you can see ( or I tried to explain ), the values are only iterated once.


However, the dict_replace is, indeed, called multiple times with the original dictionary. In the example, you may notice that there is still a %placeholder%, so Autosubmit detects it and calls the whole function again to remove it.

So, the function to improve somehow is:

def substitute_dynamic_variables(self, parameters=None, max_deep=25, dict_keys_type=None, not_in_data="", in_the_end=False):

LuiggiTenorioK commented 3 months ago

In GitLab by @dbeltrankyl on Aug 5, 2024, 10:36

Also, stuff like having a %PLAceholder% that is never converted ( because the value is not found ) will cause to autosubmit to call this function 25 times. This is the first thing that should've improved and maybe this is the cause of the slowness on the testing suite

LuiggiTenorioK commented 3 months ago

In GitLab by @dbeltrankyl on Aug 5, 2024, 10:41

mentioned in issue testing_suite#65

LuiggiTenorioK commented 3 months ago

In GitLab by @dbeltrankyl on Aug 9, 2024, 12:15

Hello @Lerriola

I did a merge request to reduce the number of calls !7 ( not inside the function you mentioned but the calls that are calling it). It will reduce the delay of the testing suite, but I think it will not be enough only with that.

Also, since the AS versions require a specific version of the Autosubmitconfigparser to use this modification, you will need to wait until the beta of v4.1.10

LuiggiTenorioK commented 3 months ago

In GitLab by @Lerriola on Aug 9, 2024, 14:13

Hi @dbeltrankyl, thanks for all the help. I saw the merge request and saw that you basically removed the recursive calls there?

However, the dict_replace is, indeed, called multiple times with the original dictionary. In the example, you may notice that there is still a %placeholder%, so Autosubmit detects it and calls the whole function again to remove it.

So, the function to improve somehow is:

def substitute_dynamic_variables(self, parameters=None, max_deep=25, dict_keys_type=None, not_in_data="", in_the_end=False):

It is then this function substitute_dynamic_variables that there was some revisiting of the same values ? Sorry, I'm not really familiar with AS code and jumping in like this is a bit difficult.

But anyways, thanks for the improvement! I hope with this and other small nudges in the Tsuite or other sftwr can get the execution times down.

LuiggiTenorioK commented 4 weeks ago

So, the function to improve somehow is:

def substitute_dynamic_variables(self, parameters=None, max_deep=25, dict_keys_type=None, not_in_data="", in_the_end=False):

That's right! I think that it could be improved by accumulating the updates you're going to make to the dictionary and doing it in one pass.

Right now, for each N changes, the function goes through all the dictionary of size M. This leads to doing O(NM) operations.

If we can store all the changes and pass just once instead of reading the whole dictionary each time will be great. This reduces the operations to O(N+M) if the accumulated changes are stored in another dictionary.

LuiggiTenorioK commented 4 weeks ago

In GitLab by @dbeltrankyl on Oct 16, 2024, 13:21

Answered here https://earth.bsc.es/gitlab/es/autosubmit4-config-parser/-/merge_requests/11#note_313135