no warning when cluster template and actual configuration are no longer in sync

sparkvilla commented 6 years ago

This is my config file:

[setup/gbids]
provider=ansible
data_groups=nfs-server,data-storage

[cluster/gbids]
setup=gbids
data_nodes=1
ssh_to=data

# this is cloud-specific info (using OpenStack for the example)
cloud=openstack
# interface
network_ids=c86b320c-9542-4032-a951-c8a068894cc2,281ab00c-121d-4398-9637-362050e885bd
security_group=default
# Ubuntu 18.04
image_id=b95cc56c-c200-44c7-a68c-091c981c6b8a
# `login` info is -in theory- image-specific
login=ubuntu

[cluster/gbids/data]
flavor=1cpu-4ram-hpc

I defined one node called data and the command elasticluster start gbids produces the expected result.

I then update the config file to include a new registry node. Here the relevant part of the conf file:

[setup/gbids]
provider=ansible
data_groups=nfs-server,data-storage
registry_groups=docker-registry

[cluster/gbids]
setup=gbids
data_nodes=1
registry_nodes=1
ssh_to=data

[cluster/gbids/data]
flavor=1cpu-4ram-hpc

[cluster/gbids/registry]
flavor=1cpu-4ram-hpc

and I run again: elasticluster start gbids

I get this error:

2018-09-19 08:28:08 master001 gc3.elasticluster[7635] ERROR Could not start cluster `gbids`: 'registry'
2018-09-19 08:28:08 master001 gc3.elasticluster[7635] ERROR Error: 'registry'
Traceback (most recent call last):
  File "/home/ubuntu/elasticluster/src/elasticluster/__main__.py", line 195, in main
    return self.params.func()
  File "/home/ubuntu/elasticluster/src/elasticluster/subcommands.py", line 85, in __call__
    return self.execute()
  File "/home/ubuntu/elasticluster/src/elasticluster/subcommands.py", line 213, in execute
    cluster.start(min_nodes, self.params.max_concurrent_requests)
  File "/home/ubuntu/elasticluster/src/elasticluster/cluster.py", line 468, in start
    self._check_cluster_size(self._compute_min_nodes(min_nodes))
  File "/home/ubuntu/elasticluster/src/elasticluster/cluster.py", line 663, in _check_cluster_size
    available = len(self.nodes[kind])
KeyError: 'registry'
Aborting because of errors: 'registry'.

So it seems the start command does not listen for config file update and/or do not point the user in the right direction, i.e. by running elasticluster stop gbids first, and then again elasticluster start gbids elasticluster would pick the updated config file.

riccardomurri commented 6 years ago

So it seems the start command does not listen for config file update

This is so by design. The [cluster/*] sections in configuration files are cluster templates; elasticluster start instanciates a template into an actual cluster, and the actual cluster configuration is saved into the "storage directory" (~/.elasticluster/storage/ by default).

There is no ElastiCluster command to update an actual cluster config from the template, but it is not clear how such an update would work in many cases. Suppose you changed the flavor for the "data" nodes: what should ElastiCluster do? create only new nodes with the new flavor? rebuild all existing nodes with the new flavor? And what if you removed a node type? I am not sure I can see a way to "do the right thing" that's consistent with all use cases; suggestions are very welcome!

do not point the user in the right direction, i.e. by running elasticluster stop gbids first, and then again elasticluster start gbids, elasticluster would pick the updated config file.

Indeed, error reporting is still very poor. I'll update the issue title to reflect this, which is the only direction we can move forward with the current design.

sparkvilla commented 6 years ago

I see the point . I think updating the error message to point the user in the right direction would be of great help. Thanks.

elasticluster / elasticluster

no warning when cluster template and actual configuration are no longer in sync #580