Rappsilber-Laboratory / XiSearch

XiSearch
Apache License 2.0
9 stars 7 forks source link

User Configurable Watchdog Parameters #110

Closed xLinkKnight closed 3 months ago

xLinkKnight commented 3 months ago

Can the hardcoded parameters dictating the thread/search closure be user editable?

Specifically in main/java/rappsilber/applications/SimpleXiProcess.java

Code snippet:

        // setup a watchdog that kills off the search f no change happen for a long time
        if (m_config.retrieveObject("WATCHDOG", true)) {
            final long watchdoginterval = 60000;
            watchdog = new Timer("Watchdog", true);
            TimerTask watchdogTask = new TimerTask() {

                int maxCountDown=30;
                {
                    try {
                        maxCountDown=m_config.retrieveObject("WATCHDOG", 30);
                    } catch(Exception e){};
                }
                int tickCountDown=maxCountDown;
                long lastProcessesd=0;
                int checkGC = 10;
                boolean first = true;
                @Override
                public void run() {
                    try {

                        long proc = getProcessedSpectra();
                        if (lastProcessesd !=proc) {
                            lastProcessesd=proc;
                            tickCountDown=maxCountDown;
                            sendPing();
                        } else {
                            // if we are on the first one double the countdown time
                            if ((proc > 0 &&tickCountDown--==0) || (tickCountDown<-maxCountDown)) {
                                Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, "\n"
                                        + "================================\n"
                                        + "==       Watch Dog Kill       ==\n"
                                        + "==        Stacktraces         ==\n"
                                        + "================================\n");

                                Util.logStackTraces(Level.SEVERE);
                                Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, "\n"
                                        + "================================\n"
                                        + "== stacktraces finished ==\n"
                                        + "================================");
                                Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, "Long time no change - assuming something is wrong -> exiting");
                                System.exit(1000);
                            } else {
                                if (first) {
                                    first = false;
                                    return;
                                }
                                System.out.println("****WATCHDOG**** countdown " + tickCountDown);
                                if (tickCountDown%5 == 0) {
                                    Logger.getLogger(this.getClass().getName()).log(Level.WARNING, "Long time no change - count down to kill : " + tickCountDown + " minutes");
                                }
                                // we haven't given up yet so lets ping that we are still alive
                                sendPing();
                            }
                        }       
                        if (--checkGC==0) {
                            checkGC();
                            checkGC=10;
                        }
                    } catch (Exception e) {
                        Logger.getLogger(this.getClass().getName()).log(Level.WARNING,"Error im watchdog : ", e);
                    }
                }

                /**
                 * starts the ping in its own thread so as not to interfere with the watchdog
                 */
                public void sendPing() {
                    // ping the world to say we are still alive
    //                Runnable runnablePing = new Runnable() {
    //                    public void run() {
    //                        m_output.ping();
    //                    }
    //                };
    //                Thread t = new Thread(runnablePing, "ping");
    //                t.setDaemon(true);
    //                t.start();
                }
            };
            watchdog.scheduleAtFixedRate(watchdogTask, 10, watchdoginterval);
        }

We're running into the issue where xiSEARCH will close because the 30 min watchdog timer expires. Our search scenario consists of 200-2000 protein entries. We aren't resource limited as we've inspected our thread and RAM utilization. We're not approaching our memory limits as we've used the -Xmx flag to fully make use of system memory.

We've only noticed that the search would progress and the CPU utilization would plummet after some arbitrary time. For example, invoking a search with 128 threads out of our our 256 thread system will correctly tax the system to ~50%. After about a day, the utilization would drop to 1-5% with no indication as to what changed in the log. No threads were closed. We're using version 1.7.6.7.

We see the following:

--------------------------
--- Thread stack-trace ---
--------------------------
--- 3449 : Search_211
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
java.util.concurrent.ArrayBlockingQueue.offer(Unknown Source)
rappsilber.ms.dataAccess.output.BufferedResultWriter.writeResult(BufferedResultWriter.java:196)
rappsilber.ms.dataAccess.output.AbstractStackedResultWriter.innerWriteResult(AbstractStackedResultWriter.java:38)
rappsilber.ms.dataAccess.output.MinimumRequirementsFilter.writeResult(MinimumRequirementsFilter.java:72)
rappsilber.applications.SimpleXiProcess.outputScanMatches(SimpleXiProcess.java:2006)
rappsilber.applications.SimpleXiProcessMultipleCandidates.process(SimpleXiProcessMultipleCandidates.java:580)
rappsilber.applications.SimpleXiProcess$SearchRunner.run(SimpleXiProcess.java:208)
java.lang.Thread.run(Unknown Source)

--------------------------
--- Thread stack-trace ---
--------------------------
--- 3450 : BufferedResultWriter_batchforward3450
rappsilber.ms.dataAccess.output.BufferedResultWriter.batchWriteResult(BufferedResultWriter.java:230)
rappsilber.ms.dataAccess.output.BufferedResultWriter.processQueueBatch(BufferedResultWriter.java:394)
rappsilber.ms.dataAccess.output.BufferedResultWriter.access$000(BufferedResultWriter.java:48)
rappsilber.ms.dataAccess.output.BufferedResultWriter$1.run(BufferedResultWriter.java:155)
java.lang.Thread.run(Unknown Source)

--------------------------
--- Thread stack-trace ---
--------------------------
--- 3470 : Watchdog
--- DAEMON-THREAD
java.lang.Thread.getStackTrace(Unknown Source)
rappsilber.utils.Util.getStackTraces(Util.java:643)
rappsilber.utils.Util.logStackTraces(Util.java:621)
rappsilber.utils.Util.logStackTraces(Util.java:617)
rappsilber.applications.SimpleXiProcess$1.run(SimpleXiProcess.java:1166)
java.util.TimerThread.mainLoop(Unknown Source)
java.util.TimerThread.run(Unknown Source)

Jul 21, 2024 12:57:29 PM rappsilber.applications.SimpleXiProcess$1 run
SEVERE:
================================
== stacktraces finished ==
================================
Jul 21, 2024 12:57:29 PM rappsilber.applications.SimpleXiProcess$1 run
SEVERE: Long time no change - assuming something is wrong -> exiting
Press any key to continue . . .

I've attempted to play around with the BufferInput and BufferOutput parameters to help with the issue. Not sure if this was correct to do.

grandrea commented 3 months ago

Yes- WATCHDOG is editable in config, and this is something that may need to be done for large searches. See https://github.com/Rappsilber-Laboratory/xisearch?tab=readme-ov-file#search-settings . Simply add WATCHDOG:10000 in your config file to increase the timer to 10,000 seconds. I should improve clarity on what that means.

xLinkKnight commented 3 months ago

Perfect! I missed it by only reviewing the BasicConfig file and looking for any commented out setting.