Closed rosarp closed 7 years ago
When Postres connection is canceled, it needs to be cleaned up and created again.
We're not cancelling the Connection. We're cancelling a Statement.
Are you saying that the Postgres implementation is broken: Cancelling a Statement will render the complete Connection disfunctional? This is the JDBC Statement#cancel javadoc:
/**
* Cancels this <code>Statement</code> object if both the DBMS and
* driver support aborting an SQL statement.
* This method can be used by one thread to cancel a statement that
* is being executed by another thread.
*
* @exception SQLException if a database access error occurs or
* this method is called on a closed <code>Statement</code>
* @exception SQLFeatureNotSupportedException if the JDBC driver does not support
* this method
*/
void cancel() throws SQLException;
Based on the documentation, a specific implementation (like Postgres) may not support cancellation, but that would be indicated by a SQLFeatureNotSupportedException
.
I think what's really happening is indicated by the error
PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block
I don't think we call setAutoCommit(false)
on that connection, but for some reason Postgres seems to consider the cancelled statement a situation where the connection is in the middle of an incomplete transaction.
So maybe try this in RDBArchiveReader#cancel:
# Cancel ongoing statement because reader is closed,
# another request was started, we no longer care about this old request
statement.cancel();
# For Postgres: Close the ongoing transaction
rdb.getConnection().rollback();
I can only try with Oracle, but when we call cancel, autoCommit is on, we don't turn it off, so in principle there are no transactions. Cancelling a statement should just cancel a statement, it doesn't leave any transaction open that needs to be committed or rolled back.
Still, if Postgres for some reason considers a cancelled Statement an open transaction, adding rdb.getConnection().rollback()
to close that transaction doesn't bother Oracle, and if that fixes the Postgres, we can add it.
Thanks Kay for quick help. I agree, by default we have setAutoCommit(true). But for StoredProcedureValueIterator & RawSampleIterator, we have it setAutoCommit(false) while execution. First I thought, only statements marked as cancellable will need rollback. But, it didn't work. Also, when autocommit is off then only connection needs to be rolledback.
Along with this, it also needs delay of 1000 ms in new scheduled ArchiveFetchJob after previous cancel statement.
I have attached below my modified code for review.
diff --git a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
index d853af7..6ba7341 100644
--- a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
+++ b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
@@ -10,6 +10,7 @@
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
+import java.sql.SQLException;
import java.sql.Statement;
import java.time.Duration;
import java.time.Instant;
@@ -511,6 +512,28 @@
"Attempt to cancel statement", ex); //$NON-NLS-1$
}
}
+ // For Postgres: Rollback the ongoing transaction
+ try
+ {
+ Connection conn = rdb.getConnection();
+ synchronized (conn)
+ {
+ if (conn != null && conn.getAutoCommit() == false) {
+ conn.rollback();
+ Logger.getLogger(Activator.ID).log(Level.INFO,
+ "Rollback successful"); //$NON-NLS-1$
+ }
+ }
+ }
+ catch (SQLException e)
+ {
+ Logger.getLogger(Activator.ID).log(Level.WARNING,
+ "Attempt to rollback failed with SQLException", e); //$NON-NLS-1$
+ } catch (Exception e)
+ {
+ Logger.getLogger(Activator.ID).log(Level.WARNING,
+ "Attempt to rollback failed with Exception", e); //$NON-NLS-1$
+ }
}
}
@@ -519,7 +542,6 @@
public void close()
{
cancel();
- ConnectionCache.release(rdb);
}
@Override
diff --git a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
index 1ad33a1..039ff0e 100644
--- a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
+++ b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/ui/Controller.java
@@ -702,6 +702,6 @@
job = new ArchiveFetchJob(pv_item, start, end, archive_fetch_listener);
archive_fetch_jobs.add(job);
}
- job.schedule();
+ job.schedule(1000);
}
}
OK, thanks for finding the reason for the autocommit being turned off. So for Postgres, either the StoredProcedureValueIterator or the RawSampleIterator can setAutoCommit(false), and in case of cancellation a) that's not reverted - bug b) the connection requires a rollback
Why do you synchronize on the 'conn'? That connection is held by just that instance of the RDBArchiveReader. If there is indeed a synchronization issue within the RDBArchiveReader, every access to the 'conn' would need to synchronize. I don't think that's the case, so we shouldn't just sprinkle a single 'synchronize (conn)' in there for no reason.
The 'job.schedule(1000)' might seem to help you right now, but it's the wrong solution. Why do you wait 1 second? For all but your Postgres setup, this simply makes all archive access slower by 1 second. For your Postgres setup, it might "work" right now, but next week you'll find that there are cases where you need to wait 1250ms, so then we make it 2000ms just-in-case? Please try to find the actual underlying problem. Maybe the problem is that the connection was set to autoCommit=false, then cancelled, and while you now add the rollback, the connection is left in the autoCommit=false state? So in the rollback, also restore autoCommit=true, allowing re-use of that connection as soon as possible without need to delay everything by 1 second.
Thanks Kay. I understood, & agree that delay in job schedule is not going to help.
So, I investigated further and found this happening.
There are two connection calls made from WorkerThread.
Now sequence of events is as follows.
Most of the cases, 10 has cancelled == false & 13 has cancelled == true. But since both fails, no result is shown.
This all happens just because now Postgres needs autocommit to be off during transaction. Conflicting at step 10 to fail.
Now, I have modified the code and attached here, Which helps to recover second job from failing, since first one is correctly cancelled and will end. It takes 1 second to cancel WT-1 and recover the state. This is noticeable in progress window.
diff --git a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
index d853af7..4f9aeec 100644
--- a/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
+++ b/applications/archive/archive-plugins/org.csstudio.archive.reader.rdb/src/org/csstudio/archive/reader/rdb/RDBArchiveReader.java
@@ -10,13 +10,13 @@
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
+import java.sql.SQLException;
import java.sql.Statement;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.logging.Level;
-import java.util.logging.Logger;
import org.csstudio.archive.rdb.RDBArchivePreferences;
import org.csstudio.archive.reader.ArchiveInfo;
@@ -25,6 +25,7 @@
import org.csstudio.archive.reader.ValueIterator;
import org.csstudio.archive.vtype.TimestampHelper;
import org.csstudio.platform.utility.rdb.RDBUtil.Dialect;
+import org.csstudio.archive.reader.rdb.Activator;
import org.diirt.util.time.TimeDuration;
import org.diirt.vtype.AlarmSeverity;
@@ -41,6 +42,9 @@
/** Oracle error code "error occurred at recursive SQL level ...: */
final private static String ORACLE_RECURSIVE_ERROR = "ORA-00604"; //$NON-NLS-1$
+
+ /** Poll period in millisecs */
+ private static final int POLL_PERIOD_MS = 1000;
final private boolean use_array_blob;
@@ -155,18 +159,39 @@
private HashMap<Integer, String> getStatusValues() throws Exception
{
final HashMap<Integer, String> stati = new HashMap<Integer, String>();
- try
- (
- final Statement statement = rdb.getConnection().createStatement();
- )
- {
- if (timeout > 0)
- statement.setQueryTimeout(timeout);
- statement.setFetchSize(100);
- final ResultSet result = statement.executeQuery(sql.sel_stati);
- while (result.next())
- stati.put(result.getInt(1), result.getString(2));
- return stati;
+ int i=0;
+ while(true) {
+ try
+ (
+ final Statement statement = rdb.getConnection().createStatement();
+ )
+ {
+ if (timeout > 0)
+ statement.setQueryTimeout(timeout);
+ statement.setFetchSize(100);
+ final ResultSet result = statement.executeQuery(sql.sel_stati);
+ while (result.next())
+ stati.put(result.getInt(1), result.getString(2));
+ return stati;
+ }
+ catch (SQLException ex)
+ {
+ if(i++ < 3)
+ {
+ Thread.sleep(POLL_PERIOD_MS);
+ if(! rdb.getConnection().getAutoCommit())
+ {
+ rdb.getConnection().rollback();
+ rdb.getConnection().setAutoCommit(true);
+ }
+ Activator.getLogger().log(Level.WARNING, "Retrying getStatusValues " + i);
+ continue;
+ }
+ else
+ {
+ throw ex;
+ }
+ }
}
}
@@ -507,10 +532,28 @@
}
catch (Exception ex)
{
- Logger.getLogger(Activator.ID).log(Level.WARNING,
- "Attempt to cancel statement", ex); //$NON-NLS-1$
+ Activator.getLogger().log(Level.WARNING, "Attempt to cancel statement", ex); //$NON-NLS-1$
}
}
+ try
+ {
+ Connection connection = rdb.getConnection();
+ if (connection != null && connection.getAutoCommit() == false)
+ {
+ connection.rollback();
+ connection.setAutoCommit(true);
+ }
+ }
+ catch (SQLException ex)
+ {
+ Activator.getLogger().log(Level.WARNING,
+ "Attempt to rollback failed with SQLException", ex); //$NON-NLS-1$
+ }
+ catch (Exception ex)
+ {
+ Activator.getLogger().log(Level.WARNING,
+ "Attempt to cancel statement failed with Exception", ex); //$NON-NLS-1$
+ }
}
}
diff --git a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
index 34fe64c..06eb46c 100644
--- a/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
+++ b/applications/databrowser/databrowser-plugins/org.csstudio.trends.databrowser2/src/org/csstudio/trends/databrowser2/archive/ArchiveFetchJob.java
@@ -7,6 +7,7 @@
******************************************************************************/
package org.csstudio.trends.databrowser2.archive;
+import java.sql.SQLException;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
@@ -151,6 +152,12 @@
break;
value_iter.close();
}
+ catch (SQLException ex)
+ {
+ // user cancelled current transaction exception suggests canceling of transaction
+ Activator.getLogger().log(Level.INFO, "Cancelled status 6 " + cancelled);
+ Activator.getLogger().log(Level.INFO, "SQLException in worker thread: " + ex.getMessage());
+ }
catch (Exception ex)
{ // Tell listener unless it's the result of a 'cancel'?
if (! cancelled)
Thank you very much for looking into the underlying reason!
I'm still not too happy with having to try several times in getStatusValues().
Seems a better solution would be to improve the canceling, step 6 in your list. If the cancel performs a proper cleanup, which might require waiting until all canceled jobs exit, there should be no need for workarounds in the following retrieval jobs.
I'll look into that.
I'm updating the ArchiveFetchJob to exit sooner once the worker thread is done, eliminating the 1 second delay in there.
Also updating the Controller to wait until a cancelled ArchiveFetchJob actually finishes, so the connection (auto-commit or not) is then properly reset.
Can you try the changes from the pull request?
In my tests, they result in ongoing ArchiveFetchJobs to properly end before starting new ones for an item, cancelled or not.
There might still be problems with Postgres seeing statements in a bad state. We need to fix that by adding for example try .. finally
to the RDBArchiveReader
, restoring the auto-commit mode or rolling back transactions so that the next ArchiveFetchJob can then succeed.
I checked the changes on postgres.
As you depicted, it needs try...catch to clean the connections broken by two scenarios.
I have modified RDBArchiveReader to add rollback & setAutommit(true) and created a pull request. Please review.
Does it now overall work for you with Postgres?
Yes Kay. It works for me with Postgres for below mentioned scenarios.
OK, closing the issue
Plotting a PV over a long period of time will end up in making a request to the BEAUTY RDB which will take many minutes - the progress view will show the background operation.
During the execution of the query, it is possible to change the time request for the plot by changing start/end time or by using the panning tool. In this case, the previous request will be aborted and a new one sent.
But it happens that interacting with the plot properties view without changing the time scale will end unexpectedly the request with the following exception:
From then on, the RDB connection has to be re-initialised and the simple way to do that is to restart cs-studio.
Solution: I found two issues here.
To resolve this, I suggest below changes.
These changes will be required at least by Postgres Db. With other Db, if this issue is not reproducible, then it might have performance impact.
P.S. : I am working ITER CSS.