DUNE-DAQ / nanorc

2 stars 2 forks source link

10 seconds timeout for k8spm #257

Closed plasorak closed 3 months ago

plasorak commented 3 months ago

To test this. Get the latest nightly, and checkout production/v4 of listrev. Then:

$ cat >lr.json <<EOL
{
    "boot":{
        "k8s_image": "ghcr.io/dune-daq/alma9-run:develop",
        "process_manager": "k8s",
        "ers_impl":"cern",
        "opmon_impl":"cern",
        "use_connectivity_service": false,
        "start_connectivity_service":false
    }
}
EOL

$ listrev_gen  -c lr.json lr
$ scale_listrev_app --num-apps 100  lr
$ nanorc --pm k8s://np04-srv-016:31000 lr session-name
# ... start the run etc.

It also passed the minimal_system_quick_test integ tests.

plasorak commented 3 months ago

It was pointed out that this creates "listrev application bombs" on the whole cluster, so before starting the run, make sure to select the nodes that won't interfere with data taking. Before executing scale_listrev_app, add the following in your boot.json:

{
    "apps": {
        "listrev-app-s-0": {
            "...",
            "node-selection": [
                {
                    "kubernetes.io/hostname": [
                        "np02-srv-001",
                        "np02-srv-003",
                        "np02-srv-004",
                        "np04-srv-011",
                        "np04-srv-012",
                        "np04-srv-013",
                        "np04-srv-015",
                        "np04-srv-018",
                        "np04-srv-019",
                        "np04-srv-024",
                        "np04-srv-031"
                    ],
                    "strict": true
                }
            ],
        }
    }
}

and make sure that the boot.json has this snippet for each app after running scale_listrev_app.

TiagoTAlves commented 3 months ago

Should we just add it to the script? @plasorak

plasorak commented 3 months ago

I'd say no, as this snippet is np04-specific and this list will likely change according to the data-taking conditions. This is for testing now at np04, it will likely not be valid if and when we need to do these tests later on.