chime-experiment / coco

A Config Controller
https://chime-coco.readthedocs.io/
GNU General Public License v3.0
3 stars 2 forks source link

coco status and GPU node config mismatch not detected. #228

Closed andrerenard closed 3 years ago

andrerenard commented 3 years ago

From the slack messages on March 12th:

We've been having some issues with the Pulsar system after updating the gains on beam 11 (the 12th beam). I'm not sure if this is related, but when I run coco status I see this for beam 10:

          gain_dir: /invalid
          kotekan_update_endpoint: json

However when I go to one of the nodes and see what kotekan reports I get:

"gain_dir": "/mnt/frb-archiver/daily_gain_solutions/Latest_PSR",
"kotekan_update_endpoint": "json"
},

And if I look at the default in the kotekan science config I see:

      kotekan_update_endpoint: json
      gain_dir: /mnt/frb-archiver/daily_gain_solutions/Latest_PSR

So a) I don't know where the /invalid is coming from, and b) why is coco is reporting that when the nodes appear to believe something else. Shouldn't that be triggering a desync event? Might be related to: https://github.com/kotekan/kotekan/issues/935

nritsche commented 3 years ago

We've been having some issues with the Pulsar system after updating the gains on beam 11 (the 12th beam). I'm not sure if this is related, but when I run coco status I see this for beam 10:

          gain_dir: /invalid
          kotekan_update_endpoint: json

\invalid is the default value for gain_dir in the chime_keep_gpu_warm.yaml

In this coco config, both keep-warm and the normal science-run config are loaded and stored in coco's state, so that we can switch between them with dedicated endpoints. The coco status endpoint returns the entire internal state (among other things), so that will return both the normal cluster as well as the keep-warm config. I see how that can be confusing, maybe we should change or replace the status endpoint. It was a direct replacement of a kotekan_master endpoint and no more thought was put into it.

If indeed you find this to differ between the running config kept by coco and a node, please re-open.