TDB issue on controller recovery

mcevik0 commented 4 years ago

When I tried to stateful-restart the controller, TDB was not synched and model state was lost on ExoSM. Relevant observations are below:

One rack loses public connection, but slice operations take place against this rack (new slices etc)
Reservations are in ticketed,redeeming state and are stuck. They cannot be deleted through pequod.
Depending on the conditions above (or other separate reasons), high cpu and memory resources are used by the controller JVM. Response to stop/start of JVM is affected from timeouts. (Wrapper may have options to set these timeouts)
Manually cleaning ticketed reservations on SM (Ticketed reservation cleanup) relieves high resource consumption and ExoSM response for requests.
On certain attempts, stateful-restart with recovery caused the TDB sync and loss of model state which is the main problem.

While trying to debug the issue, I had seen the log statement below during the problem (I'm not sure how relevant it is). Also, entire controller.log is attached at the end.

INFO   | jvm 1    | 2020/03/11 19:26:06 | com.hp.hpl.jena.tdb.base.file.FileException: ObjectFileStorage.read[nodes](74312)[filesize=115410][file.size()=115410]: Impossibly large object : 492780077 bytes > filesize-(loc+SizeOfInt)=41094
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:346)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:79)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.readNodeFromTable(NodeTableNative.java:178)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._retrieveNodeByNodeId(NodeTableNative.java:103)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getNodeForNodeId(NodeTableNative.java:74)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:103)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:74)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:55)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.TupleLib.triple(TupleLib.java:137)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.TupleLib.triple(TupleLib.java:114)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.TupleLib.access$000(TupleLib.java:45)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.TupleLib$3.convert(TupleLib.java:76)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.tdb.lib.TupleLib$3.convert(TupleLib.java:72)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at org.apache.jena.atlas.iterator.Iter$4.next(Iter.java:322)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at org.apache.jena.atlas.iterator.Iter$4.next(Iter.java:322)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at org.apache.jena.atlas.iterator.Iter$4.next(Iter.java:322)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at org.apache.jena.atlas.iterator.Iter.next(Iter.java:920)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.util.iterator.WrappedIterator.next(WrappedIterator.java:94)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.util.iterator.Map1Iterator.next(Map1Iterator.java:45)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.util.iterator.NiceIterator.asSet(NiceIterator.java:206)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.util.iterator.NiceIterator.toSet(NiceIterator.java:197)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.graph.GraphUtil.listPredicates(GraphUtil.java:89)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.rdf.model.impl.ModelCom.listPredicates(ModelCom.java:961)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.rdf.model.impl.ModelCom.listNameSpaces(ModelCom.java:974)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.addNameSpaces(BaseXMLWriter.java:221)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.setupNamespaces(BaseXMLWriter.java:479)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:461)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:450)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.rdf.model.impl.ModelCom.write(ModelCom.java:347)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at com.hp.hpl.jena.ontology.impl.OntModelImpl.write(OntModelImpl.java:2637)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at orca.controllers.xmlrpc.ReservationConverter.getManifest(ReservationConverter.java:1228)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at orca.controllers.xmlrpc.XmlrpcHandlerHelper.getSliceManifest(XmlrpcHandlerHelper.java:895)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at orca.controllers.xmlrpc.pubsub.PublishManager$PublisherTask.doPublish(PublishManager.java:332)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at orca.controllers.xmlrpc.pubsub.PublishManager$PublisherTask.run(PublishManager.java:195)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
INFO   | jvm 1    | 2020/03/11 19:26:06 |   at java.lang.Thread.run(Thread.java:748)

Controller log that shows loss of model state: controller-stdout.log.txt

With a quick search for previous issues, there are some other TDB issues, they may or may not relevant. #29 #25

kthare10 commented 4 years ago

Based on discussion following actions were planed:

Increase timeout for TDB sync on shutdown
Cleanup for redeeming reservations at SM

Cleanup for [Ticketed,Redeeming] reservations

Reproduced this on rocky-hn
Current cleanup as listed here doesn't work as reservations are not in Closed, Failed or ClosedWait state.

Solution: Closing the reservations before removing them causes the cleanup to be successful

show deadslices for sm-to-be-cleaned-up
set current slices
show reservations for current actor sm-to-be-cleaned-up
set current reservations
manage remove slice current actor sm-to-be-cleaned-up
manage close reservation current actor sm-to-be-cleaned-up
manage remove reservation current actor sm-to-be-cleaned-up
manage remove slice current actor broker-to-which-sm-talks
manage close reservation current actor broker-to-which-sm-talks
manage remove reservation current actor broker-to-which-sm-talks

ibaldin commented 4 years ago

OK, that's good. @mcevik0 will this work for you?

mcevik0 commented 4 years ago

There is a live example of such stuck ticketed reservations on ExoSM. I tried the steps above to remove this reservation and posted my output, but it seems that reservation cannot be removed.

@kthare10 - Would it be possible to take a look at that reservation and test the solution on it? I tried to be careful with the commands below, it can be good to take a look at them.

Reservation

pequod>show reservations for all actor exo-sm state ticketed
50a02169-0cc6-4969-92de-87318308a75b    exo-sm
Slice: 41888103-305b-4dc6-9870-722eee855cbc
1    slvmsite.vm    [ ticketed, active]    
Notices: Reservation 50a02169-0cc6-4969-92de-87318308a75b (Slice urn:publicid:IDN+wall2.ilabt.iminds.be:fed4fire+slice+gfZ5iNOfj0vhb) is in state [Ticketed,Redeeming]
Start: Wed Mar 18 19:57:57 EDT 2020    End: Wed Mar 18 20:57:49 EDT 2020    Requested end: Wed Mar 18 20:57:49 EDT 2020

Close reservation

pequod>manage close reservation current actor exo-sm
Closed reservation 50a02169-0cc6-4969-92de-87318308a75b on exo-sm with result true

Reservation stays


pequod>show reservations for all actor exo-sm state ticketed
50a02169-0cc6-4969-92de-87318308a75b    exo-sm
Slice: 41888103-305b-4dc6-9870-722eee855cbc
1    slvmsite.vm    [ ticketed, active]    
Notices: Reservation 50a02169-0cc6-4969-92de-87318308a75b (Slice urn:publicid:IDN+wall2.ilabt.iminds.be:fed4fire+slice+gfZ5iNOfj0vhb) is in state [Ticketed,Redeeming]
Start: Wed Mar 18 19:57:57 EDT 2020    End: Wed Mar 18 20:57:49 EDT 2020    Requested end: Wed Mar 18 20:57:49 EDT 2020

Total: 1 reservations


- Use the steps above

pequod>manage remove slice 41888103-305b-4dc6-9870-722eee855cbc actor exo-sm Removed slice 41888103-305b-4dc6-9870-722eee855cbc on exo-sm with result true

pequod>manage close reservation 50a02169-0cc6-4969-92de-87318308a75b actor exo-sm Closed reservation 50a02169-0cc6-4969-92de-87318308a75b on exo-sm with result true

pequod>manage remove reservation 50a02169-0cc6-4969-92de-87318308a75b actor exo-sm Removed reservation 50a02169-0cc6-4969-92de-87318308a75b on exo-sm with result false

pequod>manage remove slice 41888103-305b-4dc6-9870-722eee855cbc actor ndl-broker Removed slice 41888103-305b-4dc6-9870-722eee855cbc on ndl-broker with result true

pequod>manage close reservation 50a02169-0cc6-4969-92de-87318308a75b actor ndl-broker Closed reservation 50a02169-0cc6-4969-92de-87318308a75b on ndl-broker with result false

pequod>manage remove reservation 50a02169-0cc6-4969-92de-87318308a75b actor ndl-broker Removed reservation 50a02169-0cc6-4969-92de-87318308a75b on ndl-broker with result true


- Reservation stays

pequod>show reservations for all actor exo-sm state ticketed 50a02169-0cc6-4969-92de-87318308a75b exo-sm Slice: 41888103-305b-4dc6-9870-722eee855cbc 1 slvmsite.vm [ ticketed, active]
Notices: Reservation 50a02169-0cc6-4969-92de-87318308a75b (Slice urn:publicid:IDN+wall2.ilabt.iminds.be:fed4fire+slice+gfZ5iNOfj0vhb) is in state [Ticketed,Redeeming] Start: Wed Mar 18 19:57:57 EDT 2020 End: Wed Mar 18 20:57:49 EDT 2020 Requested end: Wed Mar 18 20:57:49 EDT 2020

Total: 1 reservations

kthare10 commented 4 years ago

@mcevik0 - ticketed active reservation takes a bit too close. It will actually trigger leave for the corresponding resource before going to close state. You may have to wait a bit before triggering the remove.

ibaldin commented 4 years ago

So this one is different - this is active. I thought we were talking about ticketed/redeeming

kthare10 commented 4 years ago

I did test ticketed/active scenario on rocky-hn as well, and ticket moved to closed state after relinquishing the resource. This behavior is as per the implementation as depicted below.

        case ReservationStates.Ticketed:
            if (pending != ReservationStates.Redeeming) {
                transition("close", ReservationStates.Closed, pending);
                // tell the broker we do not need the resources anymore
                doRelinquish();
            } else {
                // the redeem is in progress.
                // delay the close until the redeem comes back
                logger.info("Received close for a redeeming reservation. Deferring close until redeem completes.");
                closedDuringRedeem = true;
            }
            break;

However, in the example above, Ticketed,Active reservation only exists on exo-sm. It is present neither at ndl-broker or sl-vm-am. sl-hn logs indicate the corresponding VM has already been released based on calendar expiry. But the ticket doesn't seem to be moving to closed state and hence can not be removed. I will look more to see how this case can be handled.

kthare10 commented 4 years ago

Closing a redeeming state is prohibited by the implementation in ReservationClient.java which handles SM reservations. Implementation waits for the redeem to complete and then performs the close, but in the scenario reported above, the redeem never completes. On tests for this scenario, on rocky-hn, the reservations do cleanup after the expiry automatically, but that doesn't seem to be the case on Exo-SM as indicated in the example above.

I propose to add a check on close handling in Ticketed state, if the pending state is Redeeming and term has expired, reservation can be closed. This would be specifically to cleanup stuck reservations. New code depicted in if(term != null) block below.

        case ReservationStates.Ticketed:
            if (pending != ReservationStates.Redeeming) {
                transition("close", ReservationStates.Closed, pending);
                // tell the broker we do not need the resources anymore
                doRelinquish();
            } else {
                if(term != null) {
                    Date now = new Date();
                    if(term.expired(now)) {
                        logger.debug("Ticket has expired, closing a redeeming reservations");
                        transition("close", ReservationStates.Closed, pending);
                    }
                }
                // the redeem is in progress.
                // delay the close until the redeem comes back
                logger.info("Received close for a redeeming reservation. Deferring close until redeem completes.");
                closedDuringRedeem = true;
            }
            break;

kthare10 commented 4 years ago

The wrapper shutdown time can be increased by specifying the following parameter in wrapper.con(/opt/orca-controller/conf/wrapper.conf) file. Default value for this parameter is 30 seconds. I propose to increase it to 300 seconds (5 minutes)

wrapper.shutdown.timeout=30

RENCI-NRIG / orca5

TDB issue on controller recovery #264

Cleanup for [Ticketed,Redeeming] reservations