Modify doesn't respect core constraints

ibaldin commented 7 years ago

As reported by @mcevik0

Cause TicketReview closure on Modify request

When the slice is modified by "Increase NodeGroup Size" and VMs more than the configured max number of cores are requested, reservations are not closed.

Create a request with two NodeGroups (each has 1 XOXlarge VM). NodeGroup0 --- BroadcastLink --- NodeGroup1
Submit
Poll manifest
Modify slice by adding more VMs to NodeGroup1. "Increase Node Group Size" with 20 XOXlarge VMs.
Submit changes
Poll manifest.

Slice is active with 22 (2+20) XOXlarge VMs requested through rack-controller. Rack has 52 cores delegated to ndl-broker, 52 cores for wvn-broker. No more than 13 XOXlarge VMs should have been requested.
Same happens when ExoSM is used.

screen shot 2017-04-18 at 11 11 30

### Initial
pequod:show>show available for wvn-sm actor wvn-broker
Resources available to wvn-sm from wvn-broker
    Resource wvnNet.vlan = 10
    Resource wvnvmsite.vlan = 50
    Resource wvnvmsite.vm = 52

### Final
pequod:show>show available for wvn-sm actor wvn-broker
Resources available to wvn-sm from wvn-broker
    Resource wvnNet.vlan = 10
    Resource wvnvmsite.vlan = 49
    Resource wvnvmsite.vm = 30

pequod:show>show reservationProperties for current actor wvn-sm type config filter "ec2.instance.type"
Reservation a696929e-0825-4217-bd22-bd662271f2ce:
a696929e-0825-4217-bd22-bd662271f2ce
CONFIG:
    unit.ec2.instance.type = xo.xlarge
...
Reservation d06a891c-a6f8-4f02-a8d5-1f89057cb4bf:
d06a891c-a6f8-4f02-a8d5-1f89057cb4bf
CONFIG:
    unit.ec2.instance.type = xo.xlarge

Total: 22 reservations

First identified in RENCI-NRIG/exogeni#125

hinchliff commented 7 years ago

I can't seem to get the Modify to work for this sort of request.

Is this what the Increase Node Group Size... option is supposed to look like? increase_group_size

If I put 20 in that box, and then Submit Changes (and Poll/Query Manifest), my slice doesn't look like it has changed it all. It is still just two VMs connected by a broadcast link.

Could someone give me a Request and a Modify RDF?

ibaldin commented 7 years ago

You need to start with a node group

ibaldin commented 7 years ago

Actually I see you did already. Try starting with nodegroups of size > 1

I suspect there may be a bug related to converting a nodegroup into a single node.

hinchliff commented 7 years ago

I still get the same thing, starting with Node Groups of size 2. (I think Mert's original was with Node Groups of size 1).

ibaldin commented 7 years ago

It may be broken then

hinchliff commented 7 years ago

Modify might be failing on simple VM additions too.

This might be the exception:

controller.log.6:2017-05-19 15:05:41,945 [qtp1574943246-34 - /orca/xmlrpc] ERROR controller.OrcaXmlrpcHandler - getSliceManifest(): converter unable to get manifest: java.lang.IllegalArgumentException: Model is a null pointer

controller.log.6:2017-05-19 15:05:41,946 [qtp1574943246-34 - /orca/xmlrpc] ERROR controller.OrcaXmlrpcHandler - getSliceManifest(): Exception encountered: OrcaControllerException: ERROR: Failed due to exception: java.lang.IllegalArgumentException: Model is a null pointer

controller.log.6:2017-05-19 15:05:41,946 [qtp1574943246-34 - /orca/xmlrpc] ERROR controller.OrcaXmlrpcHandler - sliceStatus(): ControllerException: OrcaControllerException: ERROR: Exception encountered: orca.controllers.OrcaControllerException: OrcaControllerException: ERROR: Failed due to exception: java.lang.IllegalArgumentException: Model is a null pointer

hinchliff commented 7 years ago

OK, a modify request that is a simple VM addition still works. The above errors seem to be unrelated, and previous modify requests were only failing because of a problem local to UFL.

Will investigate the Increase Node Group Size bug next week.

hinchliff commented 7 years ago

The modify is not occurring because IP Address information seems to be missing, causing this NPE:

java.lang.NullPointerException
    at orca.embed.cloudembed.MappingHandler.getIPRange(MappingHandler.java:199)
    at orca.embed.cloudembed.controller.ModifyHandler.addElements(ModifyHandler.java:459)
    at orca.embed.cloudembed.controller.ModifyHandler.modifySlice(ModifyHandler.java:139)
    at orca.embed.workflow.RequestWorkflow.modify(RequestWorkflow.java:249)
    at orca.controllers.xmlrpc.OrcaXmlrpcHandler.modifySlice(OrcaXmlrpcHandler.java:594)

hinchliff commented 7 years ago

Will return to investigate this ticket, related to TicketReview, after #137 is resolved.

hinchliff commented 7 years ago

There seem to be a couple of things going on with this ticket.

There is one relatively easy fix, which is to make sure that the Controller adds the core constraints / requested resources (e.g. Num CPU) to the request. This allows the SM to verify the actual availability of resources, producing expected results.

A secondary issue is that it seems like the Controller doesn't verify available resources at all for this type of Modify request (NodeGroup Increases). We should probably fix that? This is possibly slightly more difficult, because I don't immediately see where to plug that in. And, in a sense, we have acknowledged that the Controller is not always going to have accurate counts of available resources, so maybe it's not that bad that the request will make it to the SM before it fails?

Thoughts?

hinchliff commented 7 years ago

@YufengXin thoughts on whether the Controller should be checking the available resource count for NodeGroup increases? Or should we just let the SM handle it? (See previous comment)

RENCI-NRIG / orca5

Modify doesn't respect core constraints #122

Cause TicketReview closure on Modify request