Closed anriban closed 7 years ago
Are we confident it’s a bug and not a transient configuration issue?
I suspect it is a bug. I tried 5 times today, with requests generated both with Flukes and AHAB, on two racks, and have been able to consistently observe the same behavior.
This request succeeded on RCI and BBN racks.
I will check PSC - OSF rack case.
I tried it with UMass - SL, and still see the same issue with the VM at UMass failing similarly. I still think it has got to do with wrong instance type being passed to the handler for the failing VM. If the VM image can't be launched with m1.large (the wrong type being passed), it is failing. I believe that the VM image used in the RCI-BBN test was a different one and worked with the m1.large instance type at RCI.
Just checking is this related to #41 ?
No, I don't think this is related to #41 . Everything is bound here.
The controller is printing out this error message, only for the single VM:
2017-04-20 10:08:57,160 [qtp837902375-93484 - /orca/xmlrpc] ERROR ndl.logger - ReservationConverter:No constraints on VM request!
This comes from ReservationConver
This behavior shows up in unit tests, so I should be able to find the bug.
I'm not sure if 5aac125ea0476f7d7cd7bf5a6691cb27ee249f8f is the right way to fix this problem. I'm not convinced we should be using existing_ce
at all in this case.
I probably need to make an MP Modify test case, to make sure we don't break those. Earlier commit 74165cee0a62eff2c480e247d9807d8dc257f3d1 is relevant to this.
@ibaldin @YufengXin thoughts?
I'm not sure I know how to make a modify request for a multipoint request.
I started with Anirban's original request. For the modify, I tried adding a fourth node, to a new domain. I tried wiring the Node directly to the existing VLAN in the middle. This resulted in the exception Path has 3 (odd number) of endpoints
.
Next, I tried adding a new broadcast domain on the modify, and wired up all 4 nodes to this new broadcast domain. This resulted in a different error Unable to satisfy request due to 99 (Exception in finding common label:java.lang.Exception: Passed in static label is not in the available labelset:static=-1;set=.
What should the original and modify requests look like to test adding a new inter-rack MP link in modifying
?
Hi, Alan,
Modifying inter-rack MP request doesn’t work because the departure drive AM policy doesn’t support it. However, controller used to work. I’ll need to look into it.
-Yufeng
On Apr 24, 2017, at 9:33 AM, hinchliff notifications@github.com wrote:
I'm not sure I know how to make a modify request for a multipoint request.
I started with Anirban's original request. For the modify, I tried adding a fourth node, to a new domain. I tried wiring the Node directly to the existing VLAN in the middle. This resulted in the exception Path has 3 (odd number) of endpoints.
Next, I tried adding a new broadcast domain on the modify, and wired up all 4 nodes to this new broadcast domain. This resulted in a different error Unable to satisfy request due to 99 (Exception in finding common label:java.lang.Exception: Passed in static label is not in the available labelset:static=-1;set=.
What should the original and modify requests look like to test adding a new inter-rack MP link in modifying?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RENCI-NRIG/orca5/issues/106#issuecomment-296669633, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPA5km4p3iVpsxowUD2iae3QoTxIhDpks5rzKShgaJpZM4MCXfb.
Attached is a multi-point request, which is failing. The request was created and submitted using Flukes.
The request rdf file is at http://geni-images.renci.org/images/anirban/tmp/flukes-req-mp.rdf
Of the three VMs, two were assigned to OSF rack and the third was assigned to PSC rack. The VM assigned to the PSC rack fails. Here's the manifest snapshot.
The manifest rdf file is at http://geni-images.renci.org/images/anirban/tmp/flukes-manifest-mp.rdf
I have reversed the domains, and the same thing happens - the lone VM fails. Also, the VM fails very fast, which might indicate that nova is bailing out very soon. There is nothing wrong with the image. The same image works fine on the same sites when all the VMs are assigned to a single site.
One possible reason might be because the VM handler gets a wrong instance type for some reason - m1.large, even though XO Large is assigned in the request. The VMs that succeed get the right instance type.
INFO | jvm 1 | 2017/02/15 22:25:25 | join: INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2 HANDLER: JOIN on 02/15/2017 22:25:25 UTC INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] Cloud Type: nova-essex INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] invoking image proxy http://psc-hn.exogeni.net:11081/axis2/services/IMAGEPROXY to download and install image http://geni-images.renci.org/images/anirban/adamant/genovariant-0.12/genovariant-0.12.xml... (may take some time) INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] Status from image proxy is SUCCESS INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] New EMI from image proxy is 5d8d4006-d7fd-438f-a360-21036acd125d INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] New EKI from image proxy is 1f7201cb-172e-4679-89fb-e67dae26aafc INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] New ERI from image proxy is b34e7766-a5d9-4f25-b92e-8a67ef5edc6e INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] User did not specify instance type, using default m1.large INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] creating Euca instance http://geni-orca.renci.org/owl/9f44aa1e-002b-41fc-b953-df420d4cafab#Master using emi 5d8d4006-d7fd-438f-a360-21036acd125d ...(may take some time) INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] /etc/orca/am+broker-12080//packages/pkg/87DBB2BE-7977-4463-A0CB-F6043735DEF9/scripts//nova-essex-start INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] UNIT_HOSTNAME_URL=http://geni-orca.renci.org/owl/9f44aa1e-002b-41fc-b953-df420d4cafab#Master INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_HOME=/etc/orca/am+broker-12080//packages/pkg/87DBB2BE-7977-4463-A0CB-F6043735DEF9/tools/ INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EUCA_KEY_DIR=/etc/orca/am+broker-12080/ec2 INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_LOG_DIR=/var/log/orca INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_LOG_FILE=handler-vm.log INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_LOG_LEVEL=debug INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EUCA_GROUP=default INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] NEUCA_INI=${neuca.ini.file} INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] AMI_NAME=5d8d4006-d7fd-438f-a360-21036acd125d INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] AKI_NAME=1f7201cb-172e-4679-89fb-e67dae26aafc INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] ARI_NAME=b34e7766-a5d9-4f25-b92e-8a67ef5edc6e INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_INSTANCE_TYPE=m1.large INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_SSH_KEY=geni-orca INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_USE_PUBLIC_ADDRESSING=true INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_PING_RETRIES=60 INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_SSH_RETRIES=60 INFO | jvm 1 | 2017/02/15 22:25:25 | INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_SSH_TIMEOUT=${ec2.ssh.timeout} INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_STARTUP_RETRIES=10 INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_CONNECTION_TIMEOUT=60 INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_REQUEST_TIMEOUT=120 INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] EC2_SITE_PROPERTIES=/etc/orca/am+broker-12080/config/ec2.site.properties INFO | jvm 1 | 2017/02/15 22:25:25 | [echo] before create instance INFO | jvm 1 | 2017/02/15 22:25:48 | [exec] Result: 1 INFO | jvm 1 | 2017/02/15 22:25:48 | [echo] after create instance: exit code 1, Cannot get console log INFO | jvm 1 | 2017/02/15 22:25:48 | [echo] unable to create instance: exit code 1, Cannot get console log INFO | jvm 1 | 2017/02/15 22:25:48 | [echo] console-log: Cannot get console log INFO | jvm 1 | 2017/02/15 22:25:48 | [echo] join exit code: 1 INFO | jvm 1 | 2017/02/15 22:25:48 | INFO | jvm 1 | 2017/02/15 22:25:48 | BUILD SUCCESSFUL INFO | jvm 1 | 2017/02/15 22:25:48 | Total time: 23 seconds
@YufengXin @paul-ruth