allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

PatchJsonArgParse: some arguments are parsed but later deleted #1138

Closed jhaux closed 11 months ago

jhaux commented 11 months ago

Problem:

When using LightningCLI, i.e. jsonargparse, some arguments cannot be found in the task's Configuration section or when calling task.get_parameters(). This makes running tasks remotely impossible.

This seems to be caused by this line of code in PatchJsonArgParse._update_task_args: https://github.com/allegroai/clearml/blob/8a834af777d7c4d1541573158d627c9d39f5c7c5/clearml/binding/jsonargs_bind.py#L82C14-L82C14

Solution

After I delete this line, everything works fine and tasks can be run remotely, which is the expected behavior.

Reproducibility

See comment below: https://github.com/allegroai/clearml/issues/1138#issuecomment-1766109334

jhaux commented 11 months ago

@alex-burlacu-clear-ml, it seems you introduced the code back in June.

After digging into it a little bit, I feel it might have to do with multiple config files being stacked on top of each other, but I might be wrong.

See my comment below: I think the problem lies with too generously deleting config parameters. I do not fully understand the criteria for deleting parameter, so I cannot propose a solution as of now. What do you think?

jhaux commented 11 months ago

Concerning reproducibility: I ran the jsonargparse example examples/frameworks/jsonargparse/pytorch_lightning_cli.py and could reproduce the observed behaviour, that arguments supplied by a config are not being captured.

Steps

1. setup
# clone clearml repo
python -m venv venv
pip install clearml
pip install "lightning[pytorch-extra]"

Output of pip freeze:

aiohttp==3.8.6
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
async-timeout==4.0.3
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.0
clearml==1.13.1
contourpy==1.1.1
cycler==0.12.1
docstring-parser==0.15
filelock==3.12.4
fonttools==4.43.1
frozenlist==1.4.0
fsspec==2023.9.2
furl==2.1.3
hydra-core==1.3.2
idna==3.4
importlib-resources==6.1.0
Jinja2==3.1.2
jsonargparse==4.25.0
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
kiwisolver==1.4.5
lightning==2.1.0
lightning-utilities==0.9.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.0
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
numpy==1.26.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.2.140
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
orderedmultidict==1.0.1
packaging==23.2
pathlib2==2.3.7.post1
Pillow==10.1.0
protobuf==4.24.4
psutil==5.9.6
Pygments==2.16.1
PyJWT==2.4.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytorch-lightning==2.1.0
PyYAML==6.0.1
referencing==0.30.2
requests==2.31.0
rich==13.6.0
rpds-py==0.10.6
six==1.16.0
sympy==1.12
tensorboardX==2.6.2.2
torch==2.1.0
torchmetrics==1.2.0
tqdm==4.66.1
triton==2.1.0
typeshed-client==2.4.0
typing_extensions==4.8.0
urllib3==2.0.6
yarl==1.9.2
zipp==3.17.0
python --version
Python 3.9.12
2. running the command
cd clearml/examples/frameworks/jsonargparse
python pytorch_lightning_cli.py fit -c pytorch_lightning_cli.yml

This runs locally as expected for 10 epochs, as set in pytorch_lightning_cli.yml, but if you check the task's parameters, the argument trainer.max_epochs is not set!

Output of print(json.dumps(task.get_parameters(), sort_keys=True, indent=2)):

{
  "Args/fit.ckpt_path": "",
  "Args/fit.config": "[{\"path\": \"src/pytorch_lightning_cli.yml\", \"mode\": \"fsr\", \"cwd\": null, \"skip_check\": false}]",
  "Args/fit.model.learning_rate": "0.02",
  "Args/fit.model.out_dim": "10",
  "Args/fit.seed_everything": "True",
  "Args/fit.trainer.accelerator": "auto",
  "Args/fit.trainer.accumulate_grad_batches": "1",
  "Args/fit.trainer.barebones": "False",
  "Args/fit.trainer.benchmark": "",
  "Args/fit.trainer.callbacks": "[{\"class_path\": \"lightning.pytorch.callbacks.LearningRateMonitor\", \"init_args\": {\"logging_interval\": \"epoch\", \"log_momentum\": false, \"log_weight_decay\": false}}, {\"class_path\": \"lightning.pytorch.callbacks.ModelCheckpoint\", \"init_args\": {\"dirpath\": null, \"filename\": \"best\", \"monitor\": \"train_loss\", \"verbose\": false, \"save_last\": false, \"save_top_k\": 1, \"save_weights_only\": false, \"mode\": \"min\", \"auto_insert_metric_name\": true, \"every_n_train_steps\": null, \"train_time_interval\": null, \"every_n_epochs\": null, \"save_on_train_epoch_end\": null, \"enable_version_counter\": true}}]",
  "Args/fit.trainer.check_val_every_n_epoch": "1",
  "Args/fit.trainer.default_root_dir": "",
  "Args/fit.trainer.detect_anomaly": "False",
  "Args/fit.trainer.deterministic": "",
  "Args/fit.trainer.devices": "auto",
  "Args/fit.trainer.enable_checkpointing": "",
  "Args/fit.trainer.enable_model_summary": "",
  "Args/fit.trainer.enable_progress_bar": "",
  "Args/fit.trainer.fast_dev_run": "False",
  "Args/fit.trainer.gradient_clip_algorithm": "",
  "Args/fit.trainer.gradient_clip_val": "",
  "Args/fit.trainer.inference_mode": "True",
  "Args/fit.trainer.limit_predict_batches": "",
  "Args/fit.trainer.limit_test_batches": "",
  "Args/fit.trainer.limit_train_batches": "",
  "Args/fit.trainer.limit_val_batches": "",
  "Args/fit.trainer.log_every_n_steps": "",
  "Args/fit.trainer.logger": "",
  "Args/fit.trainer.max_steps": "-1",
  "Args/fit.trainer.max_time": "",
  "Args/fit.trainer.min_epochs": "",
  "Args/fit.trainer.min_steps": "",
  "Args/fit.trainer.num_nodes": "1",
  "Args/fit.trainer.num_sanity_val_steps": "",
  "Args/fit.trainer.overfit_batches": "0.0",
  "Args/fit.trainer.plugins": "",
  "Args/fit.trainer.precision": "",
  "Args/fit.trainer.profiler": "",
  "Args/fit.trainer.reload_dataloaders_every_n_epochs": "0",
  "Args/fit.trainer.strategy": "auto",
  "Args/fit.trainer.sync_batchnorm": "False",
  "Args/fit.trainer.use_distributed_sampler": "True",
  "Args/fit.trainer.val_check_interval": "",
  "Args/subcommand": "fit"
}

This when running the experiment remotely or cloning the task, the experiment runs for more than 10 epochs.

Note however that while trainer.max_epochs is missing, the callbacks, also set in the config are tracked.

3 Running the command with no config supplied

To further validate that this must have to do with supplying the config, I run the command with no config supplied.

python src/pytorch_lightning_cli.py fit

Now the output of print(json.dumps(task.get_parameters(), sort_keys=True, indent=2)) contains the value for trainer.max_epochs:

{
  "Args/config": "",
  "Args/fit.ckpt_path": "",
  "Args/fit.config": "",
  "Args/fit.model.learning_rate": "0.02",
  "Args/fit.model.out_dim": "10",
  "Args/fit.seed_everything": "True",
  "Args/fit.trainer.accelerator": "auto",
  "Args/fit.trainer.accumulate_grad_batches": "1",
  "Args/fit.trainer.barebones": "False",
  "Args/fit.trainer.benchmark": "",
  "Args/fit.trainer.callbacks": "",
  "Args/fit.trainer.check_val_every_n_epoch": "1",
  "Args/fit.trainer.default_root_dir": "",
  "Args/fit.trainer.detect_anomaly": "False",
  "Args/fit.trainer.deterministic": "",
  "Args/fit.trainer.devices": "auto",
  "Args/fit.trainer.enable_checkpointing": "",
  "Args/fit.trainer.enable_model_summary": "",
  "Args/fit.trainer.enable_progress_bar": "",
  "Args/fit.trainer.fast_dev_run": "False",
  "Args/fit.trainer.gradient_clip_algorithm": "",
  "Args/fit.trainer.gradient_clip_val": "",
  "Args/fit.trainer.inference_mode": "True",
  "Args/fit.trainer.limit_predict_batches": "",
  "Args/fit.trainer.limit_test_batches": "",
  "Args/fit.trainer.limit_train_batches": "",
  "Args/fit.trainer.limit_val_batches": "",
  "Args/fit.trainer.log_every_n_steps": "",
  "Args/fit.trainer.logger": "",
  "Args/fit.trainer.max_epochs": "",
  "Args/fit.trainer.max_steps": "-1",
  "Args/fit.trainer.max_time": "",
  "Args/fit.trainer.min_epochs": "",
  "Args/fit.trainer.min_steps": "",
  "Args/fit.trainer.num_nodes": "1",
  "Args/fit.trainer.num_sanity_val_steps": "",
  "Args/fit.trainer.overfit_batches": "0.0",
  "Args/fit.trainer.plugins": "",
  "Args/fit.trainer.precision": "",
  "Args/fit.trainer.profiler": "",
  "Args/fit.trainer.reload_dataloaders_every_n_epochs": "0",
  "Args/fit.trainer.strategy": "auto",
  "Args/fit.trainer.sync_batchnorm": "False",
  "Args/fit.trainer.use_distributed_sampler": "True",
  "Args/fit.trainer.val_check_interval": "",
  "Args/subcommand": "fit"
}

Modifications

To produce the above outputs (and because black was running) I applied the following changes:

diff --git a/examples/frameworks/jsonargparse/pytorch_lightning_cli.py b/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
index 31142b6..c1eaa71 100644
--- a/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
+++ b/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
@@ -1,14 +1,18 @@
+import json
+
 try:
     from lightning.pytorch.cli import LightningCLI
-    from lightning.pytorch.demos.boring_classes import DemoModel, BoringDataModule
+    from lightning.pytorch.demos.boring_classes import (BoringDataModule,
+                                                        DemoModel)
 except ImportError:
     import sys
+
     print("Module 'lightning' not installed (only available for Python 3.8+)")
     sys.exit(0)
 from clearml import Task

-
 if __name__ == "__main__":
     Task.add_requirements("requirements.txt")
-    Task.init(project_name="example", task_name="pytorch_lightning_jsonargparse")
+    task = Task.init(project_name="example", task_name="pytorch_lightning_jsonargparse")
     LightningCLI(DemoModel, BoringDataModule)
+    print(json.dumps(task.get_parameters(), indent=2, sort_keys=True))
eugen-ajechiloae-clearml commented 11 months ago

Hi @jhaux ! This is actually not a bug. The jsonargparse binding works the following way: If a config file is specified, then the parameters from the config are removed from the UI (they won't appear under the configuration section). Those parameters will then be fetched from your config when running remotely. If you add them manually to the configuration section (which you could try out), then those values will be used instead (overwritting the values from the config). This is such that changes to the config will be reflected in the execution, but of course, adding the parameters manually in the UI will also work.

Also, in order to run the task remotely, make sure you actually have the config file (by adding it to the same repository as your script, for example, such that both will be be pulled when the repo is cloned)

jhaux commented 11 months ago

Oh! That makes a lot of sense! Thank you for pointing it out. I'll give it a try.

jhaux commented 11 months ago

I can confirm that this works as intended! Thanks for the help @eugen-ajechiloae-clearml!

This issue can be closed. Should I do that?

Do you think it would make sense to open a PR to add the following changes?

diff --git a/examples/frameworks/jsonargparse/pytorch_lightning_cli.py b/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
index 31142b6..70a7108 100644
--- a/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
+++ b/examples/frameworks/jsonargparse/pytorch_lightning_cli.py
@@ -1,14 +1,33 @@
+import json
+
 try:
     from lightning.pytorch.cli import LightningCLI
-    from lightning.pytorch.demos.boring_classes import DemoModel, BoringDataModule
+    from lightning.pytorch.demos.boring_classes import (BoringDataModule,
+                                                        DemoModel)
 except ImportError:
     import sys
+
     print("Module 'lightning' not installed (only available for Python 3.8+)")
     sys.exit(0)
 from clearml import Task

+class MyLightningCLI(LightningCLI):
+    def __init__(self, *args, **kwargs):
+        Task.add_requirements("requirements.txt")
+        self.task = Task.init(
+            project_name="example", task_name="pytorch_lightning_jsonargparse"
+        )
+        super().__init__(*args, **kwargs)
+
+    def before_instantiate_classes(self):
+        self.task.execute_remotely(queue_name="titanx_no_reporeqs", exit_process=True)
+
+        print(
+            "after\n\n",
+            json.dumps(self.task.get_parameters(), indent=2, sort_keys=True),
+        )
+
+
 if __name__ == "__main__":
-    Task.add_requirements("requirements.txt")
-    Task.init(project_name="example", task_name="pytorch_lightning_jsonargparse")
-    LightningCLI(DemoModel, BoringDataModule)
+    MyLightningCLI(DemoModel, BoringDataModule)
diff --git a/examples/frameworks/jsonargparse/requirements.txt b/examples/frameworks/jsonargparse/requirements.txt
index 6b54e31..a064354 100644
--- a/examples/frameworks/jsonargparse/requirements.txt
+++ b/examples/frameworks/jsonargparse/requirements.txt
@@ -1,3 +1,4 @@
+clearml
 jsonargparse
 pytorch_lightning
 torch
eugen-ajechiloae-clearml commented 11 months ago

Hi @jhaux ! I'd usually not want to have something like self.task.execute_remotely(queue_name="titanx_no_reporeqs", exit_process=True) in an example that is not explicitly made for remote execution, but you could PR a new example instead, that would be much appreciated