ControlSystemStudio / cs-studio

Control System Studio is an Eclipse-based collections of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
https://controlsystemstudio.org/
Eclipse Public License 1.0
113 stars 96 forks source link

RDBArchiveReader - Postgres connection & multi-threading issue #1929

Closed rosarp closed 7 years ago

rosarp commented 8 years ago

Plotting a PV over a long period of time will end up in making a request to the BEAUTY RDB which will take many minutes - the progress view will show the background operation.

During the execution of the query, it is possible to change the time request for the plot by changing start/end time or by using the panning tool. In this case, the previous request will be aborted and a new one sent.

But it happens that interacting with the plot properties view without changing the time scale will end unexpectedly the request with the following exception:

2016-10-13 07:20:35.861 WARNING [Thread 151]
org.csstudio.trends.databrowser2.ui.Controller$1 (archiveFetchFailed) - No
archived data for UTIL-S15-0000:15KV_ACT_POWER
org.postgresql.util.PSQLException: ERROR: current transaction is aborted,
commands ignored until end of transaction block
    at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198)
    at
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927)
    at
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
    at
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:561)
    at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:405)
    at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:285)
    at
org.csstudio.archive.reader.rdb.RDBArchiveReader.getStatusValues(RDBArchiveReader.java:166)
    at
org.csstudio.archive.reader.rdb.RDBArchiveReader.<init>(RDBArchiveReader.java:134)
    at
org.csstudio.archive.reader.rdb.RDBArchiveReader.<init>(RDBArchiveReader.java:85)
    at
org.csstudio.archive.reader.rdb.RDBArchiveReaderFactory.getArchiveReader(RDBArchiveReaderFactory.java:48)
    at
org.csstudio.archive.reader.ArchiveRepository.getArchiveReader(ArchiveRepository.java:111)
    at
org.csstudio.trends.databrowser2.archive.ArchiveFetchJob$WorkerThread.run(ArchiveFetchJob.java:124)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

From then on, the RDB connection has to be re-initialised and the simple way to do that is to restart cs-studio.

Solution: I found two issues here.

  1. When Postres connection is canceled, it needs to be cleaned up and created again.
  2. Worker thread needs little more time to properly cancel and cleanup reader before new ArchiveFetchJob is created by Controller.

To resolve this, I suggest below changes.

diff --git a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
index d853af7..31e2fa5 100644
--- a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
+++ b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
@@ -504,6 +504,9 @@
                     // does NOT stop an ongoing Oracle query!
                     // Only this seems to do it:
                     statement.cancel();
+                    // When connection is canceled
+                    // it gets broken and needs to re-establish again
+                    ConnectionCache.clean();
                 }
                 catch (Exception ex)
                 {
@@ -519,7 +522,6 @@
     public void close()
     {
         cancel();
-        ConnectionCache.release(rdb);
     }

     @Override
diff --git a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
index 1ad33a1..039ff0e 100644
--- a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
+++ b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
@@ -702,6 +702,6 @@
             job = new ArchiveFetchJob(pv_item, start, end, archive_fetch_listener);
             archive_fetch_jobs.add(job);
         }
-        job.schedule();
+        job.schedule(1000);
     }
 }

These changes will be required at least by Postgres Db. With other Db, if this issue is not reproducible, then it might have performance impact.

P.S. : I am working ITER CSS.

kasemir commented 8 years ago

When Postres connection is canceled, it needs to be cleaned up and created again.

We're not cancelling the Connection. We're cancelling a Statement.

Are you saying that the Postgres implementation is broken: Cancelling a Statement will render the complete Connection disfunctional? This is the JDBC Statement#cancel javadoc:

    /**
     * Cancels this <code>Statement</code> object if both the DBMS and
     * driver support aborting an SQL statement.
     * This method can be used by one thread to cancel a statement that
     * is being executed by another thread.
     *
     * @exception SQLException if a database access error occurs or
     * this method is called on a closed <code>Statement</code>
     * @exception SQLFeatureNotSupportedException if the JDBC driver does not support
     * this method
     */
    void cancel() throws SQLException;

Based on the documentation, a specific implementation (like Postgres) may not support cancellation, but that would be indicated by a SQLFeatureNotSupportedException.

I think what's really happening is indicated by the error

PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block

I don't think we call setAutoCommit(false) on that connection, but for some reason Postgres seems to consider the cancelled statement a situation where the connection is in the middle of an incomplete transaction.

So maybe try this in RDBArchiveReader#cancel:

                    # Cancel ongoing statement because reader is closed,
                    # another request was started, we no longer care about this old request
                    statement.cancel();
                    # For Postgres: Close the ongoing transaction
                    rdb.getConnection().rollback();
kasemir commented 8 years ago

I can only try with Oracle, but when we call cancel, autoCommit is on, we don't turn it off, so in principle there are no transactions. Cancelling a statement should just cancel a statement, it doesn't leave any transaction open that needs to be committed or rolled back. Still, if Postgres for some reason considers a cancelled Statement an open transaction, adding rdb.getConnection().rollback() to close that transaction doesn't bother Oracle, and if that fixes the Postgres, we can add it.

rosarp commented 7 years ago

Thanks Kay for quick help. I agree, by default we have setAutoCommit(true). But for StoredProcedureValueIterator & RawSampleIterator, we have it setAutoCommit(false) while execution. First I thought, only statements marked as cancellable will need rollback. But, it didn't work. Also, when autocommit is off then only connection needs to be rolledback.

Along with this, it also needs delay of 1000 ms in new scheduled ArchiveFetchJob after previous cancel statement.

I have attached below my modified code for review.

diff --git a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
index d853af7..6ba7341 100644
--- a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
+++ b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
@@ -10,6 +10,7 @@
 import java.sql.Connection;
 import java.sql.PreparedStatement;
 import java.sql.ResultSet;
+import java.sql.SQLException;
 import java.sql.Statement;
 import java.time.Duration;
 import java.time.Instant;
@@ -511,6 +512,28 @@
                         "Attempt to cancel statement", ex); //$NON-NLS-1$
                 }
             }
+            // For Postgres: Rollback the ongoing transaction
+            try
+            {
+                Connection conn = rdb.getConnection();
+                synchronized (conn)
+                {
+                    if (conn != null && conn.getAutoCommit() == false) {
+                        conn.rollback();
+                        Logger.getLogger(Activator.ID).log(Level.INFO,
+                                "Rollback successful"); //$NON-NLS-1$
+                    }
+                }
+            }
+            catch (SQLException e)
+            {
+                Logger.getLogger(Activator.ID).log(Level.WARNING,
+                        "Attempt to rollback failed with SQLException", e); //$NON-NLS-1$
+            } catch (Exception e)
+            {
+                Logger.getLogger(Activator.ID).log(Level.WARNING,
+                        "Attempt to rollback failed with Exception", e); //$NON-NLS-1$
+            }
         }
     }

@@ -519,7 +542,6 @@
     public void close()
     {
         cancel();
-        ConnectionCache.release(rdb);
     }

     @Override
diff --git a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
index 1ad33a1..039ff0e 100644
--- a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
+++ b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
@@ -702,6 +702,6 @@
             job = new ArchiveFetchJob(pv_item, start, end, archive_fetch_listener);
             archive_fetch_jobs.add(job);
         }
-        job.schedule();
+        job.schedule(1000);
     }
 }
kasemir commented 7 years ago

OK, thanks for finding the reason for the autocommit being turned off. So for Postgres, either the StoredProcedureValueIterator or the RawSampleIterator can setAutoCommit(false), and in case of cancellation a) that's not reverted - bug b) the connection requires a rollback

Why do you synchronize on the 'conn'? That connection is held by just that instance of the RDBArchiveReader. If there is indeed a synchronization issue within the RDBArchiveReader, every access to the 'conn' would need to synchronize. I don't think that's the case, so we shouldn't just sprinkle a single 'synchronize (conn)' in there for no reason.

The 'job.schedule(1000)' might seem to help you right now, but it's the wrong solution. Why do you wait 1 second? For all but your Postgres setup, this simply makes all archive access slower by 1 second. For your Postgres setup, it might "work" right now, but next week you'll find that there are cases where you need to wait 1250ms, so then we make it 2000ms just-in-case? Please try to find the actual underlying problem. Maybe the problem is that the connection was set to autoCommit=false, then cancelled, and while you now add the rollback, the connection is left in the autoCommit=false state? So in the rollback, also restore autoCommit=true, allowing re-use of that connection as soon as possible without need to delay everything by 1 second.

rosarp commented 7 years ago

Thanks Kay. I understood, & agree that delay in job schedule is not going to help.

So, I investigated further and found this happening.

There are two connection calls made from WorkerThread.

  1. getArchiveReader
  2. getRawValues or getOptimizedValues

Now sequence of events is as follows.

  1. Controller creates Job [ArchiveFetchJob A-1]
  2. A-1 submits WorkerThread [WT-1] to run.
  3. WT-1 runs getArchiveReader
  4. WT-1 runs getRawValues or getOptimizedValues [long running methods]
  5. Meanwhile user clicks on graph and it triggers same query again after 1sec delay.
  6. Since only one such job should run! Controller triggers cancel on all current ArchiveFetchJobs i.e. A-1 in this case.
  7. Controller creates Job [ArchiveFetchJob A-2]
  8. A-2 submits WorkerThread [WT-2] to run.
  9. WT-2 runs getArchiveReader
  10. WT-2 fails because either current transaction is not completed & setAutocommit(false)
  11. A-1 monitor wakes up and finds isCanceling == true
  12. A-1 calls cancel on W-1
  13. W-1 exits with cancelled by user exception.

Most of the cases, 10 has cancelled == false & 13 has cancelled == true. But since both fails, no result is shown.

This all happens just because now Postgres needs autocommit to be off during transaction. Conflicting at step 10 to fail.

Now, I have modified the code and attached here, Which helps to recover second job from failing, since first one is correctly cancelled and will end. It takes 1 second to cancel WT-1 and recover the state. This is noticeable in progress window.

diff --git a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
index d853af7..4f9aeec 100644
--- a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
+++ b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
@@ -10,13 +10,13 @@
 import java.sql.Connection;
 import java.sql.PreparedStatement;
 import java.sql.ResultSet;
+import java.sql.SQLException;
 import java.sql.Statement;
 import java.time.Duration;
 import java.time.Instant;
 import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.logging.Level;
-import java.util.logging.Logger;

 import org.csstudio.archive.rdb.RDBArchivePreferences;
 import org.csstudio.archive.reader.ArchiveInfo;
@@ -25,6 +25,7 @@
 import org.csstudio.archive.reader.ValueIterator;
 import org.csstudio.archive.vtype.TimestampHelper;
 import org.csstudio.platform.utility.rdb.RDBUtil.Dialect;
+import org.csstudio.archive.reader.rdb.Activator;
 import org.diirt.util.time.TimeDuration;
 import org.diirt.vtype.AlarmSeverity;

@@ -41,6 +42,9 @@

     /** Oracle error code "error occurred at recursive SQL level ...: */
     final private static String ORACLE_RECURSIVE_ERROR = "ORA-00604"; //$NON-NLS-1$
+
+    /** Poll period in millisecs */
+    private static final int POLL_PERIOD_MS = 1000;

     final private boolean use_array_blob;

@@ -155,18 +159,39 @@
     private HashMap<Integer, String> getStatusValues() throws Exception
     {
         final HashMap<Integer, String> stati = new HashMap<Integer, String>();
-        try
-        (
-            final Statement statement = rdb.getConnection().createStatement();
-        )
-        {
-            if (timeout > 0)
-                statement.setQueryTimeout(timeout);
-            statement.setFetchSize(100);
-            final ResultSet result = statement.executeQuery(sql.sel_stati);
-            while (result.next())
-                stati.put(result.getInt(1), result.getString(2));
-            return stati;
+        int i=0;
+        while(true) {
+            try
+            (
+                final Statement statement = rdb.getConnection().createStatement();
+            )
+            {
+                if (timeout > 0)
+                    statement.setQueryTimeout(timeout);
+                statement.setFetchSize(100);
+                final ResultSet result = statement.executeQuery(sql.sel_stati);
+                while (result.next())
+                    stati.put(result.getInt(1), result.getString(2));
+                return stati;
+            }
+            catch (SQLException ex)
+            {
+                if(i++ < 3)
+                {
+                    Thread.sleep(POLL_PERIOD_MS);
+                    if(! rdb.getConnection().getAutoCommit())
+                    {
+                        rdb.getConnection().rollback();
+                        rdb.getConnection().setAutoCommit(true);
+                    }
+                    Activator.getLogger().log(Level.WARNING, "Retrying getStatusValues " + i);
+                    continue;
+                }
+                else
+                {
+                    throw ex;
+                }
+            }
         }
     }

@@ -507,10 +532,28 @@
                 }
                 catch (Exception ex)
                 {
-                    Logger.getLogger(Activator.ID).log(Level.WARNING,
-                        "Attempt to cancel statement", ex); //$NON-NLS-1$
+                    Activator.getLogger().log(Level.WARNING, "Attempt to cancel statement", ex); //$NON-NLS-1$
                 }
             }
+            try
+            {
+                Connection connection = rdb.getConnection();
+                if (connection != null && connection.getAutoCommit() == false)
+                {
+                    connection.rollback();
+                    connection.setAutoCommit(true);
+                }
+            }
+            catch (SQLException ex)
+            {
+                Activator.getLogger().log(Level.WARNING,
+                        "Attempt to rollback failed with SQLException", ex); //$NON-NLS-1$
+            }
+            catch (Exception ex)
+            {
+                Activator.getLogger().log(Level.WARNING,
+                    "Attempt to cancel statement failed with Exception", ex); //$NON-NLS-1$
+            }
         }
     }

diff --git a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
index 34fe64c..06eb46c 100644
--- a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
+++ b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
@@ -7,6 +7,7 @@
  ******************************************************************************/
 package org.csstudio.trends.databrowser2.archive;

+import java.sql.SQLException;
 import java.time.Instant;
 import java.util.ArrayList;
 import java.util.List;
@@ -151,6 +152,12 @@
                         break;
                     value_iter.close();
                 }
+                catch (SQLException ex)
+                {
+                    // user cancelled current transaction exception suggests canceling of transaction
+                    Activator.getLogger().log(Level.INFO, "Cancelled status 6 " + cancelled);
+                    Activator.getLogger().log(Level.INFO, "SQLException in worker thread: " + ex.getMessage());
+                }
                 catch (Exception ex)
                 {   // Tell listener unless it's the result of a 'cancel'?
                     if (! cancelled)
kasemir commented 7 years ago

Thank you very much for looking into the underlying reason!

I'm still not too happy with having to try several times in getStatusValues().

Seems a better solution would be to improve the canceling, step 6 in your list. If the cancel performs a proper cleanup, which might require waiting until all canceled jobs exit, there should be no need for workarounds in the following retrieval jobs.

I'll look into that.

kasemir commented 7 years ago

I'm updating the ArchiveFetchJob to exit sooner once the worker thread is done, eliminating the 1 second delay in there.

Also updating the Controller to wait until a cancelled ArchiveFetchJob actually finishes, so the connection (auto-commit or not) is then properly reset.

kasemir commented 7 years ago

Can you try the changes from the pull request?

In my tests, they result in ongoing ArchiveFetchJobs to properly end before starting new ones for an item, cancelled or not.

There might still be problems with Postgres seeing statements in a bad state. We need to fix that by adding for example try .. finally to the RDBArchiveReader, restoring the auto-commit mode or rolling back transactions so that the next ArchiveFetchJob can then succeed.

rosarp commented 7 years ago

I checked the changes on postgres.

As you depicted, it needs try...catch to clean the connections broken by two scenarios.

  1. New job on same PV causing previous job to cancel
  2. User cancelling the Job.

I have modified RDBArchiveReader to add rollback & setAutommit(true) and created a pull request. Please review.

kasemir commented 7 years ago

Does it now overall work for you with Postgres?

rosarp commented 7 years ago

Yes Kay. It works for me with Postgres for below mentioned scenarios.

  1. New job on same PV causing previous job to cancel
  2. User cancelling the Job.
kasemir commented 7 years ago

OK, closing the issue