Closed huizhilu closed 3 years ago
@kaisun2000 One more flaky test run on GitHub :) If you have time, feel free to tackle it.
LOG 2742
2020-10-08T00:48:47.6150751Z new config:ZnRecord=TestCluster_0, { FAULT_ZONE_TYPE=Host, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, SimpleField1=Value1, SimpleField2=Value2, TOPOLOGY=/Rack/Sub-Rack/Host/Instance, TOPOLOGY_AWARE_ENABLED=true, newField=newValue } { MapField1={key1=value1, key2=value2}, MapField2={key3=value1, key4=value2}, newMap={newkey1=newvalue1, newkey2=newvalue2} } { INSTANCE_CAPACITY_KEYS=[], ListField1=[Value1, Value2, Value3], ListField2=[Value2, Value1, Value3], newList=[newValue1, newValue2] }, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0} 2020-10-08T00:48:47.6152886Z End test :testAddConfigFields 2020-10-08T00:48:47.6153403Z Start test :testUpdateConfigFields 2020-10-08T00:48:47.6185510Z new config:ZnRecord=TestCluster_0, {FAULT_ZONE_TYPE=Host, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, SimpleField1=Value1, SimpleField2=Value2, TOPOLOGY=/Rack/Sub-Rack/Host/Instance, TOPOLOGY_AWARE_ENABLED=true, newField=newValue}{MapField1={key1=value1, key2=value2}, MapField2={key3=value1, key4=value2}, newMap={newkey1=newvalue1, newkey2=newvalue2}}{INSTANCE_CAPACITY_KEYS=[], ListField1=[Value1, Value2, Value3], ListField2=[Value2, Value1, Value3], newList=[newValue1, newValue2]}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}
This is from debugger
Start test :testUpdateConfigFields new config:ZnRecord=TestCluster_0, { FAULT_ZONE_TYPE=helixZoneId, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, SimpleField1=Value1, SimpleField2=Value2, newField=newValue } { MapField1={key1=value1, key2=value2}, MapField2={key3=value1, key4=value2}, newMap={newkey1=newvalue1, newkey2=newvalue2} } { ListField1=[Value1, Value2, Value3], ListField2=[Value2, Value1, Value3], newList=[newValue1, newValue2] }, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}
Need to trace the origin of INSTANCE_CAPACITY_KEYS in github. Is it related to cloud config?
The code path from static analysis seems to be:
computeBestPossibleAssignment -->
globalRebalance -->
ResourceChangeDetector.updateSnapshots -->
ResourceChangeSnapshot(ResourceControllerDataProvider dataProvider,
boolean ignoreNonTopologyChange) -->
ClusterConfigTrimmer.getInstance().trimProperty(dataProvider.getClusterConfig()) -->
HelixPropertyTrimmer.doTrim -->
copyNonTrimmableInfo(originalZNRecord, trimmedZNRecord, getNonTrimmableFields(originalProperty),
false) -->
protected Map<FieldType, Set
private static final Map<FieldType, Set
Note
private void copyNonTrimmableInfo(ZNRecord originalZNRecord, ZNRecord trimmedZNRecord,
Map<FieldType, Set<String>> nonTrimmableFields, boolean trimValue) {
for (Map.Entry<FieldType, Set<String>> fieldEntry : nonTrimmableFields.entrySet()) {
FieldType fieldType = fieldEntry.getKey();
Set<String> fieldKeySet = fieldEntry.getValue();
if (null == fieldKeySet || fieldKeySet.isEmpty()) {
continue;
}
switch (fieldType) {
case SIMPLE_FIELD:
fieldKeySet.stream().forEach(fieldKey -> {
if (originalZNRecord.getSimpleFields().containsKey(fieldKey)) {
trimmedZNRecord.getSimpleFields().putIfAbsent(fieldKey,
trimValue ? null : originalZNRecord.getSimpleField(fieldKey));
}
});
break;
case LIST_FIELD:
fieldKeySet.stream().forEach(fieldKey -> {
if (originalZNRecord.getListFields().containsKey(fieldKey)) {
trimmedZNRecord.getListFields().putIfAbsent(fieldKey,
trimValue ? **Collections.EMPTY_LIST** : originalZNRecord.getListField(fieldKey));
}
});
break;
This would add an empty INSTANCE_CAPACITY_KEYS=[]
All in all, once waged rebalancer wrote global base line, it may add this list field. INSTANCE_CAPACITY_KEYS. This seems to be recent change from @jiajunwang?
The test is constantly failing on master now. The direct reason is the same, there is an empty INSTANCE_CAPACITY_KEYS=[],
list so we get a IndexOutOfBoundsException error.
My theory is that, this empty list could be added by TestInstancesAccessor.testValidateWeightForAllInstances() or TestPerInstanceAccessor.testValidateWeightForInstance() since these two tests and TestClusterAccessor.testUpdateConfigFields() uses the same cluster "TestCluster_0".
In the test log where testUpdateConfigFields is failing, the two tests that added this empty list always happen before this testUpdateConfigFields.
My solution is simply changing the cluster for testUpdateConfigFields to another test cluster and use "TestCluster_0" only for WAGED related tests.
@Test(dependsOnMethods = "checkUpdateFails")
public void testValidateWeightForInstance()
throws IOException {
System.out.println("Start test :" + TestHelper.getTestMethodName());
// Empty out ClusterConfig's weight key setting and InstanceConfig's capacity maps for testing
ClusterConfig clusterConfig = _configAccessor.getClusterConfig(CLUSTER_NAME);
clusterConfig.getRecord()
.setListField(ClusterConfig.ClusterConfigProperty.INSTANCE_CAPACITY_KEYS.name(),
new ArrayList<>());
_configAccessor.setClusterConfig(CLUSTER_NAME, clusterConfig);
Close test unstable tickets since we have an automatic tracking mechanism #1757 now for tracking the most recent test issues.