dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
292 stars 136 forks source link

stack trace in PinManager (8.2.20) #7128

Open DmitryLitvintsev opened 1 year ago

DmitryLitvintsev commented 1 year ago

When dCache was been restarted after upgrade to 8.2.20 the following stack trace seen (it is not an ongoing issue - I surmise these are pins held over since before restart. But isn't PinManager designed to handle restarts)

8 Apr 2023 08:14:40 (PinManager) [:814126755:814126856 SrmManager PinManagerPin] Uncaught exception in thread PinManager-38
org.springframework.dao.CannotSerializeTransactionException: PreparedStatementCallback; SQL [UPDATE pins SET state = ?,request_id = ? WHERE id = ?]; ERROR: could not serialize access due to concurrent update; nested exception is org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:275)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72)
        at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1443)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:633)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:862)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:917)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:927)
        at org.dcache.pinmanager.JdbcDao.update(JdbcDao.java:119)
        at org.dcache.pinmanager.LoggingDao.update(LoggingDao.java:329)
        at org.dcache.pinmanager.PinDao.update(PinDao.java:70)
        at org.dcache.pinmanager.PinRequestProcessor.createTask_aroundBody0(PinRequestProcessor.java:523)
        at org.dcache.pinmanager.PinRequestProcessor$AjcClosure1.run(PinRequestProcessor.java:1)
        at org.springframework.transaction.aspectj.AbstractTransactionAspect.ajc$around$org_springframework_transaction_aspectj_AbstractTransactionAspect$1$2a73e96cproceed(AbstractTransactionAspect.aj:67)
        at org.springframework.transaction.aspectj.AbstractTransactionAspect$AbstractTransactionAspect$1.proceedWithInvocation(AbstractTransactionAspect.aj:73)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:367)
        at org.springframework.transaction.aspectj.AbstractTransactionAspect.ajc$around$org_springframework_transaction_aspectj_AbstractTransactionAspect$1$2a73e96c(AbstractTransactionAspect.aj:71)
        at org.dcache.pinmanager.PinRequestProcessor.createTask(PinRequestProcessor.java:508)
        at org.dcache.pinmanager.PinRequestProcessor.messageArrived(PinRequestProcessor.java:177)
        at jdk.internal.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.dcache.cells.CellMessageDispatcher$ShortReceiver.deliver(CellMessageDispatcher.java:272)
        at org.dcache.cells.CellMessageDispatcher.call(CellMessageDispatcher.java:188)
        at org.dcache.cells.AbstractCell.messageArrived(AbstractCell.java:302)
        at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:856)
        at dmg.cells.nucleus.CellNucleus$DeliverMessageTask.run(CellNucleus.java:1274)
        at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:247)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:727)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166)
        at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134)
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
        at org.springframework.jdbc.core.JdbcTemplate.lambda$update$0(JdbcTemplate.java:867)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:617)
        ... 27 common frames omitted
lemora commented 1 year ago

But isn't PinManager designed to handle restarts

PinManager will not resume processing pinning requests that were created but not finished before the PinManager shut down, if that is what you are asking. These kinds of requests should be deleted by PinManager after it restarts once their short lifetime expires (temporarily set during pin creation).

In any case, these should be transient errors due to a high number of parallel requests to the pins database table at a certain point. Maybe PinManager was busy with lots of reissued requests that were stored in SrmManager (which does persist requests over restarts and will retry them), many new requests or maintenance tasks -- hard to say without more information.

PinManager, or in this case either SrmManager or the client issuing the srm requests should retry them if they are new pin requests. If they are indeed old ones that were attempted to be flagged for removal (setting them to READY_TO_UNPIN), then PinManager will try again.

DmitryLitvintsev commented 1 year ago

This then goes back to my questioning - why do we even need PinManager if it does not recover pin processing and we relies that client will eventually retry. But that is a larger discussion. You think the stack trace should be fixed though?

lemora commented 1 year ago

Yes, the stack trace should be fixed. I'm looking into it.