Closed diegopacheco closed 7 years ago
btw, can you make the rack names something simpler like rack: dyno_xxx--useast1e
@ipapapa I will do more debug and come back to you.
@ipapapa
Some debugging info
2a
2017-04-10 21:43:21 INFO InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2a-v72_1383429731
2017-04-10 21:43:21 INFO InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2a-v72, id: 1383429731
2c
2017-04-10 21:43:52 INFO InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2c-v72_1383429731
2017-04-10 21:43:52 INFO InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2c-v72, id: 1383429731
2017-04-10 21:43:52 INFO InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2b-v72
2017-04-10 21:43:52 INFO InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72
2b
2017-04-10 21:43:41 INFO InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731
2017-04-10 21:43:41 INFO InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2b-v72, id: 1383429731
2017-04-10 21:43:41 INFO InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72
KILL instance on AWS console (2b)
2b(new - instance)
2017-04-10 22:38:57 INFO InstanceDataDAOCassandra:122 - KEY fronm CASS: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731
2017-04-10 22:38:57 INFO InstanceDataDAOCassandra:240 - getInstance() app: myService-1p-d0perf-v72-dm, rack:myService-1p-d0perf-dm-us-west-2b-v72, id: 1383429731
2017-04-10 22:38:57 INFO InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2a-v72
2017-04-10 22:38:57 INFO InstanceDataDAOCassandra:243 - getInstance() INS ID:1383429731, INS RACK:myService-1p-d0perf-dm-us-west-2b-v72
DM used the same token info as the previous instance it did not update data in CASS. So the seeds generated are wrong.
if (getInstance(instance.getApp(), instance.getRack(), instance.getId()) != null)
return;
So no lock was acquired and data was not updated in CASS.
@ipapapa
I did more debugging.
So we HAVE:
NETFLIX_APP - I think the problem could be here. RACK - looks right to me. SLOT = Looks right to me.
I tried:
NETFLIX_APP == ASG_NAME+UUID
============================
2017-04-11 00:13:43 ERROR AWSMembership:242 - unable to get group-id for group-name=1pdecorator-1p-d0perf-v73-dm_59e9acaf-4979-41cf-9c18-3447aada23bc vpc-id=vpc-a8234ecd
2017-04-11 00:13:43 ERROR Task:86 - Could not execute the task because of The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: c28bddc9-2baf-4534-8df7-d2356ad33b4f)
com.amazonaws.AmazonServiceException: The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: c28bddc9-2baf-4534-8df7-d2356ad33b4f)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1383)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:11128)
at com.amazonaws.services.ec2.AmazonEC2Client.authorizeSecurityGroupIngress(AmazonEC2Client.java:1019)
at com.netflix.dynomitemanager.sidecore.aws.AWSMembership.addACL(AWSMembership.java:201)
at com.netflix.dynomitemanager.sidecore.aws.UpdateSecuritySettings.execute(UpdateSecuritySettings.java:65)
at com.netflix.dynomitemanager.sidecore.scheduler.Task.execute(Task.java:82)
at org.quartz.core.JobRunShell.run(JobRunShell.java:199)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:546)
NETFLIX_APP == ASG_NAME
=======================
2017-04-11 00:31:52 ERROR AWSMembership:242 - unable to get group-id for group-name=1pdecorator-1p-d0perf-dm-us-west-2b-v75 vpc-id=vpc-a8234ecd
2017-04-11 00:31:52 ERROR Task:86 - Could not execute the task because of The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: ac7f3d34-9014-44e5-b49b-ebd973301616)
com.amazonaws.AmazonServiceException: The request must contain the parameter groupName or groupId (Service: AmazonEC2; Status Code: 400; Error Code: MissingParameter; Request ID: ac7f3d34-9014-44e5-b49b-ebd973301616)
NETFLIC_APP == SG_NAME
======================
It works but them IF an instance dies and ASG replace old instance wit new one we are back to the issue because the data is already in CASS.
Any thoughts @ipapapa ?
According to the changes, @akbarahmed did a while ago in the configuration, you must not use NETFLIX_APP
but DM_DYNOMITE_CLUSTER_NAME
. You should have seen a WARN
in your logs for this.
On the logging. I made this changes in order to debug.
I hope with the code in hand the log now should make more sense.
Let me know otherwise @ipapapa
public AppsInstance getInstance(String app, String rack, int id) {
logger.info("getInstance() app: {}, rack:{}, id: {} ", new Object[]{app,rack,id});
Set<AppsInstance> set = getAllInstances(app);
for (AppsInstance ins : set) {
logger.info("getInstance() INS ID:{}, INS RACK:{} ", new Object[]{ins.getId(),ins.getRack()});
if (ins.getId() == id && ins.getRack().equals(rack))
return ins;
}
return null;
}
public void createInstanceEntry(AppsInstance instance) throws Exception {
logger.info("*** Creating New Instance Entry ***");
String key = getRowKey(instance);
logger.info("KEY fronm CASS: {}",new Object[]{key});
if (getInstance(instance.getApp(), instance.getRack(), instance.getId()) != null)
return;
getLock(instance);
try {
MutationBatch m = bootKeyspace.prepareMutationBatch();
ColumnListMutation<String> clm = m.withRow(CF_TOKENS, key);
clm.putColumn(CN_ID, Integer.toString(instance.getId()), null);
clm.putColumn(CN_APPID, instance.getApp(), null);
clm.putColumn(CN_AZ, instance.getZone(), null);
clm.putColumn(CN_DC, config.getRack(), null);
clm.putColumn(CN_INSTANCEID, instance.getInstanceId(), null);
clm.putColumn(CN_HOSTNAME, instance.getHostName(), null);
clm.putColumn(CN_EIP, instance.getHostIP(), null);
clm.putColumn(CN_TOKEN, instance.getToken(), null);
clm.putColumn(CN_LOCATION, instance.getDatacenter(), null);
clm.putColumn(CN_UPDATETIME, TimeUUIDUtils.getUniqueTimeUUIDinMicros(), null);
Map<String, Object> volumes = instance.getVolumes();
if (volumes != null) {
for (String path : volumes.keySet()) {
clm.putColumn(CN_VOLUME_PREFIX + "_" + path, volumes.get(path).toString(),
null);
}
}
m.execute();
logger.info(String.format("Key %s INSERTED on CASS", key));
} catch (Exception e) {
logger.info(e.getMessage());
} finally {
releaseLock(instance);
}
}
@ipapapa @akbarahmed I also swicth to DM_DYNOMITE_CLUSTER_NAME instead of NETFLIX_APP and got same error.
@diegopacheco that is not helpful thought, did you see a warning when you used NETFLIX_APP
or not? The logs should have all the information.
Check the following though:
INS ID:1383429731
the token number. This is the token. The instance ID should be something like i-abcd...
. The way you log is not useful to debug because you are not checking the if statement for exit and the names that you provide are so big that make it hard to debug. You probably need to some pattern... I explained above some ideas for rack names: dyno_xxx--useast1e
. It is pretty obvious cluster_name + ASG
@ipapapa
I switch to DM_DYNOMITE_CLUSTER_NAME it does not make difference for this case.
In regards to the ID if you check the method getInstance(String app, String rack, int id) . id is an int so it could never be a string like i-abcd...
@ipapapa
Key inside Cassandra: myService-1p-d0perf-v72-dm_myService-1p-d0perf-dm-us-west-2b-v72_1383429731 where:
We have this pattern: DM_DYNOMITE_CLUSTER_NAME + RACK + SLOT
myService-1p-d0perf-v72-dm (NETFLIX_APP/DM_DYNOMITE_CLUSTER_NAME == Security_Group_name) myService-1p-d0perf-dm-us-west-2b-v72 (rack == ASG_NAME) 1383429731 (token slot)
Values on CASS:
@ipapapa
IMHO I could be the wrong BUT IF they ID was "i-abcd...." them I believe the code would work. Maybe this bug of introduced on the code?
I do not think that is the problem, as the integers for ids are working fine internally, and the code is very much similar to Priam that has been widely used for Cassandra. After looking into the code, you are right it is an integer.
@ipapapa
IMHO we always will have this pattern: DM_DYNOMITE_CLUSTER_NAME + RACK + SLOT. So For this scenario:
DM_DYNOMITE_CLUSTER_NAME = FIXED (Security Group) name, Never Changes.
RACK = FIXED ASG(1 ASG per AZ) name, Never Changes for the same AZ.
SLOT = Number Could change but since we always have 1 instance for this scenario - it will never change.
That's code is never getting a new token in Cassandra because of this combination. Any other idea ?
If you think that using a string for the instance id might fix the problem, please feel free to give it a try. I do not have any ideas on the issue but I am confident the logs should tell you the logic.
@ipapapa
I did lots of debugging and this really looks like an issue to me. I think I fixed the issue - here is the CODE https://github.com/diegopacheco/dynomite-manager-1/tree/dev-asg-instance-replace I will do more tests to be 100% sure I fixed but looks okay to me now.
Cheers, Diego Pacheco
You are posting your whole tree... can you do a PR to see the diffs or send me the diffs
@ipapapa
Sure I will do 2 PRs to FIX this issue.
Here is the first one: https://github.com/Netflix/dynomite-manager/pull/73
@ipapapa
Submitted the last part of the FIX for this issue here: https://github.com/Netflix/dynomite-manager/pull/74
I guess this has been addressed based on the above.
@ipapapa
I create a Dynomite-Manager / Dynomite cluster using the following versions: Redis: 3.0.7 Dynomite: 0.5.8-5 DM: v1.0.1 from tag in GitHub
I create a cluster in AWS us-west-2 with 3 ASG one ASG per AZ. Like this: -node1 | us-west-2c -node2 | us-west-2a -node3 | us-west-2b
One instance only per AZ, being: MIM: 1, MAX: 1, DESIRED: 1.
It all works great I have a 3 node cluster all working fine. All 3 nodes use this token: 1383429731.
Now I go to AWS ec2 console and I TERMINATE node 3 for instance.
The ASG creates a new DM / Dynomite instance but which THE OLD and WRONG IP pointing to the old machine who just died.
DM LOG
Where:
Is wrong already and this should be pointing to the new IP but is not.
Cheers, Diego Pacheco