allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.48k stars 643 forks source link

task.connect(dict) create read-only dictionary on Linux #872

Open Vadim2S opened 1 year ago

Vadim2S commented 1 year ago

Describe the bug

After task.connect(dictionary) I am can not change dictionary values. Linux only.

To reproduce

I am use code like this:

task = Task.Init(....)
...
def main(args):
    config = unsafe_load(open(args.config, "rb"))  # config.yaml loaded here

    if in_cloud:
        config = task.connect(config)

        #anti-ml config bug
        #config = config.copy()

        dset = Dataset.get(
                dataset_id=None,
                dataset_name=args.corpus_dir,
                only_completed=True,
                only_published=False
        )
        corpus_dir = Path(dset.get_local_copy())
    else:
        corpus_dir = Path(args.corpus_dir)

    config['train_data_dir'] = str(corpus_dir.joinpath('training.csv'))

Expected behaviour

All cases: config['train_data_dir'] contains path like clearml_cache_dataset_dir/training.csv

Actual behaviour

Linux (Ubuntu 20.04) config['train_data_dir'] contains old unchanged value (error)

Windows 10: config['train_data_dir'] contains new path like expected

Environment

Related Discussion

Very strange Linux-only error. Uncomment 'config = config.copy()' line for workaround. New config instance can be modified as expected.

jkhenning commented 1 year ago

Hi @Vadim2S,

What is in_cloud? When you call Task.connect(), is your code being run by a CleaRML Agent?

Vadim2S commented 1 year ago

Sorry. My bad. I am just copy my code without comment. Here more details:

In_cloud set as True in case of using ClearML i.e always.

My work pace is following:

1) I am try new code: Windows. Code run locally without ClearML Agent, but using ClearML Task.Init Dataset loaded from ClearML server. Config dictionary after "config = task.connect(config)" is mutable.

2) I am clone task from 1), change configs and send it to queue Linux. Code run on remote computer by ClearML Agent. Dataset loaded from ClearML server. Config dictionary after "config = task.connect(config)" is READ-ONLY.

i.e minimal reproduction code something like:

    config = unsafe_load(open(args.config, "rb"))  # config.yaml loaded here
    config = task.connect(config)
    #config = config.copy()
    #do not sure if Dataset loading is vital for this case
    config['val_name'] = 'new_val'
    print(config['val_name'])

Expected result is "new_val" output.

In case 1) I am get "new_val" output. In case 2) I am get "old_val" output without any error.

If I am uncomment "config = config.copy()" line I am get "new_val" output in case 2) too.

jkhenning commented 1 year ago

@Vadim2S this is the intended behavior, and it does not actually depend on the operating system, but on the fact you are running the first flow without an agent (i.e. in what we call "local" or "development" mode) - in that case ClearML SDK is designed to "record" what you do and build a reproducible environment on the server for your task. This includes what you set into that dictionary, even after you connect it to the task. In the second flow, you are running in what we call "remote" mode, i.e. using an agent to execute your task. In that case, the agent is designed to reproduce the environment (including whatever you set into the dictionary in the local run) and make sure that's what your code gets (which is why the dictionary is read only). If you clone the task and change stuff in it before enqueuing, what your code gets in the remote run is whatever was stored in the server, including any changes you made (for example, to the connected configuration)

Vadim2S commented 1 year ago

I am presume (from documentation) what "agent reproduce the environment" executed in this code line "config = task.connect(config)". And it is all. After this I am can change environment as I am wish. Im wrong?

I am Ok if read-only is intended. But. Here arise two new errors or suggestion. As you name it. 1) Behaviour must be same. I.e. read-only always or mutable always. 2) ClearML must throw error on my config modification attempt.

I am trash whole day work running remote model training with wrong dataset because NOTHING says about my config do not changed as it is tested in local-running code.