Netflix / dynomite-manager

A sidecar to manage Dynomite clusters
https://github.com/Netflix/dynomite
Apache License 2.0
91 stars 59 forks source link

DM is not replacing a instance properly when an ASG kills a machine #72

Closed diegopacheco closed 7 years ago

diegopacheco commented 7 years ago

@ipapapa

I create a Dynomite-Manager / Dynomite cluster using the following versions: Redis: 3.0.7 Dynomite: 0.5.8-5 DM: v1.0.1 from tag in GitHub

I create a cluster in AWS us-west-2 with 3 ASG one ASG per AZ. Like this: -node1 | us-west-2c -node2 | us-west-2a -node3 | us-west-2b

One instance only per AZ, being: MIM: 1, MAX: 1, DESIRED: 1.

It all works great I have a 3 node cluster all working fine. All 3 nodes use this token: 1383429731.

Now I go to AWS ec2 console and I TERMINATE node 3 for instance.

The ASG creates a new DM / Dynomite instance but which THE OLD and WRONG IP pointing to the old machine who just died.

DM LOG

2017-03-29 21:38:34 INFO  InstanceIdentity:175 - Single Account cluster                                                                                     
2017-03-29 21:38:49 INFO  InstanceIdentity:184 - Found dead instances: i-0923175c770e04f7b                                                                  
2017-03-29 21:38:49 INFO  InstanceIdentity:194 - Trying to grab slot 1383429731 with availability zone us-west-2a                                           
2017-03-29 21:38:49 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:38:49 INFO  SystemUtils:54 - Calling URL API: http://169.254.169.254/latest/meta-data/placement/availability-zone returns: us-west-2a         
2017-03-29 21:38:49 INFO  InstanceDataDAOCassandra:119 - *** Creating New Instance Entry ***                                                                
2017-03-29 21:38:49 INFO  InstanceDataDAOCassandra:123 - Key already exists: 1pnotify-1p-d0perf-r0-v11-dm_1pnotify-1p-d0perf-r0-dm-us-west-2a-v11_1383429731
2017-03-29 21:38:49 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:38:49 INFO  InstanceIdentity:132 - My token: 1383429731                                                                                       
2017-03-29 21:38:49 INFO  DynomiteManagerServer:106 - Initializing Dynomite Manager now ...                                                                 
2017-03-29 21:38:49 INFO  SystemUtils:54 - Calling URL API: http://169.254.169.254/latest/meta-data/placement/availability-zone returns: us-west-2a         
2017-03-29 21:38:49 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:38:49 INFO  AWSMembership:339 - Fetch current permissions for vpc env of running instance                                                     
2017-03-29 21:38:49 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:38:49 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:38:49 INFO  DynomiteManagerServer:116 - Sleeping 159seconds -> a node is replaced or token is pregenerated.                                   
2017-03-29 21:41:28 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:41:28 INFO  DynomiteManagerServer:131 - Running TuneTask and updating configuration.                                                          
2017-03-29 21:41:28 INFO  SystemUtils:54 - Calling URL API: http://169.254.169.254/latest/meta-data/placement/availability-zone returns: us-west-2a         
2017-03-29 21:41:28 WARN  DynomiteManagerConfiguration:591 - NETFLIX_APP is deprecated. Use DM_DYNOMITE_CLUSTER_NAME.                                       
2017-03-29 21:41:28 INFO  DynomiteStandardTuner:123 - YAML Dump:                                                                                            
2017-03-29 21:41:28 INFO  DynomiteStandardTuner:124 - dyn_o_mite:                                                                                           
  dyn_listen: 0.0.0.0:8101                                                                                                                                  
  data_store: 0                                                                                                                                             
  listen: 0.0.0.0:8102                                                                                                                                      
  dyn_seed_provider: florida_provider                                                                                                                       
  servers:                                                                                                                                                  
  - 127.0.0.1:22122:1                                                                                                                                       
  tokens: '1383429731'                                                                                                                                      
  auto_eject_hosts: true                                                                                                                                    
  rack: my-microservice-1p-d0perf-r0-dm-us-west-2a-v11                                                                                                             
  distribution: vnode                                                                                                                                       
  gos_interval: 10000                                                                                                                                       
  hash: murmur                                                                                                                                              
  preconnect: true                                                                                                                                          
  server_retry_timeout: 30000                                                                                                                               
  timeout: 5000                                                                                                                                             
  secure_server_option: datacenter                                                                                                                          
  datacenter: us-west-2                                                                                                                                     
  read_consistency: DC_ONE                                                                                                                                  
  write_consistency: DC_ONE                                                                                                                                 
  pem_key_file: /apps/dynomite/conf/dynomite.pem                                                                                                            
  dyn_seeds:                                                                                                                                                
  - node1:8101:my-microservice-1p-d0perf-r0-dm-us-west-2c-v11:us-west-2:1383429731                                     
  - node28101:my-microservice-1p-d0perf-r0-dm-us-west-2b-v11:us-west-2:1383429731                                     
  - node3:8101:my-microservice-1p-d0perf-r0-dm-us-west-2a-v11:us-west-2:1383429731  

Where:

  - node3:8101:my-microservice-1p-d0perf-r0-dm-us-west-2a-v11:us-west-2:1383429731  

Is wrong already and this should be pointing to the new IP but is not.

Cheers, Diego Pacheco

ipapapa commented 7 years ago

btw, can you make the rack names something simpler like rack: dyno_xxx--useast1e

diegopacheco commented 7 years ago

@ipapapa I will do more debug and come back to you.

diegopacheco commented 7 years ago

@ipapapa

Some debugging info

2a

2017-04-10 21:43:21 INFO  InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2a-v72_1383429731
2017-04-10 21:43:21 INFO  InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2a-v72, id: 1383429731

2c

2017-04-10 21:43:52 INFO  InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2c-v72_1383429731
2017-04-10 21:43:52 INFO  InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2c-v72, id: 1383429731
2017-04-10 21:43:52 INFO  InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2b-v72
2017-04-10 21:43:52 INFO  InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72

2b

2017-04-10 21:43:41 INFO  InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731
2017-04-10 21:43:41 INFO  InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2b-v72, id: 1383429731
2017-04-10 21:43:41 INFO  InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72

KILL instance on AWS console (2b)

2b(new - instance)

2017-04-10 22:38:57 INFO  InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731
2017-04-10 22:38:57 INFO  InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2b-v72, id: 1383429731
2017-04-10 22:38:57 INFO  InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72
2017-04-10 22:38:57 INFO  InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2b-v72

DM used the same token info as the previous instance it did not update data in CASS. So the seeds generated are wrong.

if (getInstance(instance.getApp(), instance.getRack(), instance.getId()) != null)
return;

So no lock was acquired and data was not updated in CASS.
diegopacheco commented 7 years ago

@ipapapa

I did more debugging.

So we HAVE:

NETFLIX_APP - I think the problem could be here. RACK - looks right to me. SLOT = Looks right to me.

I tried:

NETFLIX_APP == ASG_NAME+UUID
============================

2017-04-11 00:13:43 ERROR AWSMembership:242 - unable to get group-id for group-name=1pdecorator-1p-d0perf-v73-dm_59e9acaf-4979-41cf-9c18-3447aada23bc vpc-id=vpc-a8234ecd
2017-04-11 00:13:43 ERROR Task:86 - Could not execute the task because of The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: c28bddc9-2baf-4534-8df7-d2356ad33b4f)
com.amazonaws.AmazonServiceException: The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: c28bddc9-2baf-4534-8df7-d2356ad33b4f)
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1383)
        at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
        at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
        at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
        at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:11128)
        at com.amazonaws.services.ec2.AmazonEC2Client.authorizeSecurityGroupIngress(AmazonEC2Client.java:1019)
        at com.netflix.dynomitemanager.sidecore.aws.AWSMembership.addACL(AWSMembership.java:201)
        at com.netflix.dynomitemanager.sidecore.aws.UpdateSecuritySettings.execute(UpdateSecuritySettings.java:65)
        at com.netflix.dynomitemanager.sidecore.scheduler.Task.execute(Task.java:82)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:199)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:546)

NETFLIX_APP == ASG_NAME
=======================

2017-04-11 00:31:52 ERROR AWSMembership:242 - unable to get group-id for group-name=1pdecorator-1p-d0perf-dm-us-west-2b-v75 vpc-id=vpc-a8234ecd
2017-04-11 00:31:52 ERROR Task:86 - Could not execute the task because of The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: ac7f3d34-9014-44e5-b49b-ebd973301616)
com.amazonaws.AmazonServiceException: The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: ac7f3d34-9014-44e5-b49b-ebd973301616)

NETFLIC_APP == SG_NAME
======================

It works but them IF an instance dies and ASG replace old instance wit new one we are back to the issue because the data is already in CASS.

Any thoughts @ipapapa ?

ipapapa commented 7 years ago

According to the changes, @akbarahmed did a while ago in the configuration, you must not use NETFLIX_APP but DM_DYNOMITE_CLUSTER_NAME. You should have seen a WARN in your logs for this.

https://github.com/Netflix/dynomite-manager/blob/dev/dynomitemanager/src/main/java/com/netflix/dynomitemanager/defaultimpl/DynomiteManagerConfiguration.java#L586-L593

diegopacheco commented 7 years ago

On the logging. I made this changes in order to debug.

I hope with the code in hand the log now should make more sense.

Let me know otherwise @ipapapa

    public AppsInstance getInstance(String app, String rack, int id) {
        logger.info("getInstance() app: {}, rack:{}, id: {} ", new Object[]{app,rack,id});
        Set<AppsInstance> set = getAllInstances(app);
        for (AppsInstance ins : set) {
            logger.info("getInstance() INS ID:{}, INS RACK:{} ", new Object[]{ins.getId(),ins.getRack()});
            if (ins.getId() == id && ins.getRack().equals(rack))
                return ins;
        }
        return null;
    }
    public void createInstanceEntry(AppsInstance instance) throws Exception {
        logger.info("*** Creating New Instance Entry ***");
        String key = getRowKey(instance);

        logger.info("KEY fronm CASS: {}",new Object[]{key});
        if (getInstance(instance.getApp(), instance.getRack(), instance.getId()) != null)
            return;

        getLock(instance);

        try {
            MutationBatch m = bootKeyspace.prepareMutationBatch();
            ColumnListMutation<String> clm = m.withRow(CF_TOKENS, key);
            clm.putColumn(CN_ID, Integer.toString(instance.getId()), null);
            clm.putColumn(CN_APPID, instance.getApp(), null);
            clm.putColumn(CN_AZ, instance.getZone(), null);
            clm.putColumn(CN_DC, config.getRack(), null);
            clm.putColumn(CN_INSTANCEID, instance.getInstanceId(), null);
            clm.putColumn(CN_HOSTNAME, instance.getHostName(), null);
            clm.putColumn(CN_EIP, instance.getHostIP(), null);
            clm.putColumn(CN_TOKEN, instance.getToken(), null);
            clm.putColumn(CN_LOCATION, instance.getDatacenter(), null);
            clm.putColumn(CN_UPDATETIME, TimeUUIDUtils.getUniqueTimeUUIDinMicros(), null);
            Map<String, Object> volumes = instance.getVolumes();
            if (volumes != null) {
                for (String path : volumes.keySet()) {
                    clm.putColumn(CN_VOLUME_PREFIX + "_" + path, volumes.get(path).toString(),
                            null);
                }
            }
            m.execute();
            logger.info(String.format("Key %s INSERTED on CASS", key));
        } catch (Exception e) {
            logger.info(e.getMessage());
        } finally {
            releaseLock(instance);
        }
    }
diegopacheco commented 7 years ago

@ipapapa @akbarahmed I also swicth to DM_DYNOMITE_CLUSTER_NAME instead of NETFLIX_APP and got same error.

ipapapa commented 7 years ago

@diegopacheco that is not helpful thought, did you see a warning when you used NETFLIX_APP or not? The logs should have all the information.

ipapapa commented 7 years ago

Check the following though:

  1. Why is the INS ID:1383429731 the token number. This is the token. The instance ID should be something like i-abcd....
  2. Why is the rack name so complicated? Could it be simpler?

The way you log is not useful to debug because you are not checking the if statement for exit and the names that you provide are so big that make it hard to debug. You probably need to some pattern... I explained above some ideas for rack names: dyno_xxx--useast1e. It is pretty obvious cluster_name + ASG

diegopacheco commented 7 years ago

@ipapapa

I switch to DM_DYNOMITE_CLUSTER_NAME it does not make difference for this case.

In regards to the ID if you check the method getInstance(String app, String rack, int id) . id is an int so it could never be a string like i-abcd...

diegopacheco commented 7 years ago

@ipapapa

Key inside Cassandra: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731 where:

We have this pattern: DM_DYNOMITE_CLUSTER_NAME + RACK + SLOT

myService-1p-d0perf-v72-dm (NETFLIX_APP/DM_DYNOMITE_CLUSTER_NAME == Security_Group_name) myService-1p-d0perf-dm-us-west-2b-v72 (rack == ASG_NAME) 1383429731 (token slot)

Values on CASS:

diegopacheco commented 7 years ago

@ipapapa

IMHO I could be the wrong BUT IF they ID was "i-abcd...." them I believe the code would work. Maybe this bug of introduced on the code?

ipapapa commented 7 years ago

I do not think that is the problem, as the integers for ids are working fine internally, and the code is very much similar to Priam that has been widely used for Cassandra. After looking into the code, you are right it is an integer.

diegopacheco commented 7 years ago

@ipapapa

IMHO we always will have this pattern: DM_DYNOMITE_CLUSTER_NAME + RACK + SLOT. So For this scenario:

DM_DYNOMITE_CLUSTER_NAME = FIXED (Security Group) name, Never Changes.

RACK = FIXED ASG(1 ASG per AZ) name, Never Changes for the same AZ.

SLOT = Number Could change but since we always have 1 instance for this scenario - it will never change.

That's code is never getting a new token in Cassandra because of this combination. Any other idea ?

ipapapa commented 7 years ago

If you think that using a string for the instance id might fix the problem, please feel free to give it a try. I do not have any ideas on the issue but I am confident the logs should tell you the logic.

diegopacheco commented 7 years ago

@ipapapa

I did lots of debugging and this really looks like an issue to me. I think I fixed the issue - here is the CODE https://github.com/diegopacheco/dynomite-manager-1/tree/dev-asg-instance-replace I will do more tests to be 100% sure I fixed but looks okay to me now.

Cheers, Diego Pacheco

ipapapa commented 7 years ago

You are posting your whole tree... can you do a PR to see the diffs or send me the diffs

diegopacheco commented 7 years ago

@ipapapa

Sure I will do 2 PRs to FIX this issue.

Here is the first one: https://github.com/Netflix/dynomite-manager/pull/73

diegopacheco commented 7 years ago

@ipapapa

Submitted the last part of the FIX for this issue here: https://github.com/Netflix/dynomite-manager/pull/74

ipapapa commented 7 years ago

I guess this has been addressed based on the above.