Closed tthvo closed 2 months ago
Does the smoketest.bash -V
switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.
~Discovery works fine with podman
while checks are done against database~ Though, I have run into these 2 issues that I need some advice on:
main
too.Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.
Ahh I was about to ask! Thanks I will try that!
Ahh, I am seeing an odd violation during startup:
ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1 | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1 | Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]
Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.
Ahh I was about to ask! Thanks I will try that!
Looks good to me!! It was very helpful! I tried with these steps to test this PR:
# Initial setup
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV
podman pod create --replace --name cryostat-pod
podman run \
--name jmxquarkus \
--pod cryostat-pod \
--label io.cryostat.discovery="true" \
--label io.cryostat.jmxPort="51423" \
--env QUARKUS_HTTP_PORT=10012 \
--rm -d quay.io/roberttoyonaga/jmx:jmxquarkus@sha256:b067f29faa91312d20d43c55d194a2e076de7d0d094da3d43ee7d2b2b5a6f100
podman run \
--name vertx-fib-demo-0 \
--env HTTP_PORT=8079 \
--env JMX_PORT=9089 \
--env START_DELAY=60 \
--pod cryostat-pod \
--label io.cryostat.discovery="true" \
--label io.cryostat.jmxHost="vertx-fib-demo-0" \
--label io.cryostat.jmxPort="9089" \
--rm -d quay.io/andrewazores/vertx-fib-demo:0.13.1
# Kill containers and restart
# The discovery tree should remain the same
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV
# Now, kill the containers again
# Remove the pod
podman pod rm cryostat-pod -f
# Start again
# The pod and its containers should be pruned
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV
Ahh, I am seeing an odd violation during startup:
ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception compose-cryostat-1 | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key" compose-cryostat-1 | Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]
Okay this should be fixed now thanks to the synchronized
keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: https://github.com/cryostatio/cryostat3/pull/423#issuecomment-2080168900
There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.
Yea, that was semi-intentional back when I first put all that together just so I could test that Cryostat/the database would properly reject the definitions due to the duplicate URLs. I think by now that's already well-proven, so we can change it to do something different - give the containers different hostnames to be referenced by, or just turn off one of the discovery mechanisms in the smoketest by default so they don't overlap/collide.
The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?
It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.
Anyway, if we do determine there are performance issues due to database accesses and the associated latency, Quarkus/Hibernate/JPA has a lot of powerful caching capabilities:
https://quarkus.io/guides/hibernate-orm#caching
so it seems like some judicious application of the @Cacheable
annotation would probably help solve that, if we need it to be solved.
Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: https://github.com/cryostatio/cryostat3/pull/423#issuecomment-2080168900
This means that if there are many worker threads invoked by the informer, all those threads will get blocked waiting for the lock to perform their database transaction, which will be relatively slow because they're waiting for database I/O.
I suggest another approach: have the callback (invoked by worker threads?) just add the event to a blocking queue for processing, and have a dedicated thread that takes items from the queue as they become available and processes them in a transaction. This way there is only one thread that is spending a lot of its time waiting for database I/O before moving on to process the next event and waiting for the database again, and the other threads are free to do other work.
It would be a similar pattern to how I wrote the WebSocket MessagingServer:
"event" (notification to be sent): https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L144
task queue processor: https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L95
It is probably also possible to achieve this using the EventBus alone without an explicit ExecutorService, with something like this pseudocode:
class ContainerDiscovery {
onStart() {
informer.listen(evt -> bus.publish(ContainerDiscovery.class.getName(), evt);
}
@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
... // process the event, add/remove from database
}
Sounds good to me! Just need to look into it a little deeper so I am marking this draft for now. I suppose we also want to apply these changes to kubeAPI also?
I think that makes sense.
/build_test
Workflow started at 5/2/2024, 12:57:34 AM. View Actions Run.
No OpenAPI schema changes detected.
No GraphQL schema changes detected.
CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8918817387
@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way? @Transactional void onEvent(ContainerEvent evt) { ... // process the event, add/remove from database }```
Unfortunately, for this case, we still need the order = true
as it ensures pod node is persisted once. Otherwise, multiple discovery FOUND
event messages will persist the pod node multiple times. That causes the same issue as #420. I guess its performance now is similar to synchronized
keyword?
ie. if you don't specify value = "pkg.Class" in the @ConsumeEvent, it defaults to the class name. Then you can just publish to the class name.
Phew, took me a while to realize that the @ConsumeEvent
must be specified on the sub-class so I have to do a workaround to allow sub-class to specify its own address. This also helps with separating podman
and docker
discovery event.
I think this should be working now? The changes are such that:
Handler that persists/prunes the target is invoked in a separate transaction, responding to a message received from the event bus. Thus, satisfying the requirement here:
It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.
But, seems like when using bash smoketest.bash -OV
, there are initially no LOST
message emitted from ContainerDiscovery
, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...
Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish
is called within a transaction so if current one is retried, the same message will be sent again. Tho, this PR also handles that :D
Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List<DiscoveryNode> children
needs to be eagerly initialized?
ERROR: Uncaught exception received by Vert.x
org.hibernate.LazyInitializationException: Unable to perform requested lazy initialization [io.cryostat.discovery.DiscoveryNode.children] - session is closed and settings disallow loading outside the Session
at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementHelper.throwLazyInitializationException(EnhancementHelper.java:275)
at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementHelper.performWork(EnhancementHelper.java:174)
at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementAsProxyLazinessInterceptor.handleRead(EnhancementAsProxyLazinessInterceptor.java:102)
at org.hibernate.bytecode.enhance.spi.interceptor.AbstractInterceptor.readObject(AbstractInterceptor.java:152)
at io.cryostat.discovery.DiscoveryNode.$$_hibernate_read_children(DiscoveryNode.java)
at io.cryostat.discovery.DiscoveryNode.getChildren(DiscoveryNode.java)
at io.cryostat.discovery.ContainerDiscovery.updateDiscoveryTree(ContainerDiscovery.java:489)
at io.cryostat.discovery.PodmanDiscovery.handleContainerEvent(ContainerDiscovery.java:109)
at io.cryostat.discovery.PodmanDiscovery_Subclass.handleContainerEvent$$superforward(Unknown Source)
at io.cryostat.discovery.PodmanDiscovery_Subclass$$function$$1.apply(Unknown Source)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:73)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:62)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:136)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:107)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.doIntercept(TransactionalInterceptorRequired.java:38)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.intercept(TransactionalInterceptorBase.java:61)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.intercept(TransactionalInterceptorRequired.java:32)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired_Bean.intercept(Unknown Source)
at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:27)
at io.cryostat.discovery.PodmanDiscovery_Subclass.handleContainerEvent(Unknown Source)
at io.cryostat.discovery.PodmanDiscovery_ClientProxy.handleContainerEvent(Unknown Source)
at io.cryostat.discovery.PodmanDiscovery_VertxInvoker_handleContainerEvent_151fd884ccd88ca4e8217a3be771267e67b76ed9.invokeBean(Unknown Source)
at io.quarkus.vertx.runtime.EventConsumerInvoker.invoke(EventConsumerInvoker.java:41)
at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:116)
at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:112)
at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:180)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:277)
at io.vertx.core.impl.ContextBase.lambda$internalExecuteBlocking$2(ContextBase.java:199)
at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:77)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
I guess its performance now is similar to synchronized keyword?
In terms of only allowing one event to be processed at a time, yes I think so. But I think it is still leaving more threads from the worker pool free to perform other work, rather than waiting for the lock to free so they can do the work, so it should perform better in particular if there are multiple CPUs available to actually do things in parallel.
Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish is called within a transaction so if current one is retried, the same message will be sent again.
That's interesting. What causes the transaction to have to be retried... ? That might be more telling as a root cause of some underlying problem. Are transactions frequently failing because of some constraint violations?
But, seems like when using bash smoketest.bash -OV, there are initially no LOST message emitted from ContainerDiscovery, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...
I think this is probably because of this smoketest configuration:
The usual behaviour of the smoketest is to start from scratch completely, so this line actually resets the database even if there is data left behind in the storage volume.
https://quarkus.io/guides/hibernate-orm#dev-mode
Maybe a patch like this will help:
diff --git a/compose/db.yml b/compose/db.yml
index 8c5fc19..aea89af 100644
--- a/compose/db.yml
+++ b/compose/db.yml
@@ -2,7 +2,7 @@ version: "3"
services:
cryostat:
environment:
- QUARKUS_HIBERNATE_ORM_DATABASE_GENERATION: drop-and-create
+ QUARKUS_HIBERNATE_ORM_DATABASE_GENERATION: ${DATABASE_GENERATION:-drop-and-create}
QUARKUS_DATASOURCE_USERNAME: cryostat3
QUARKUS_DATASOURCE_PASSWORD: cryostat3
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://db:5432/cryostat3
diff --git a/smoketest.bash b/smoketest.bash
index 2c1e694..03c22b8 100755
--- a/smoketest.bash
+++ b/smoketest.bash
@@ -65,6 +65,7 @@ while getopts "hs:prGtOVXcbn" opt; do
;;
V)
KEEP_VOLUMES=true
+ DATABASE_GENERATION=update
;;
X)
FILES+=("${DIR}/compose/db-viewer.yml")
@@ -111,6 +112,7 @@ fi
export CRYOSTAT_HTTP_HOST
export CRYOSTAT_HTTP_PORT
export GRAFANA_DASHBOARD_EXT_URL
+export DATABASE_GENERATION
s3Manifest="${DIR}/compose/s3-${s3}.yml"
STORAGE_PORT="$(yq '.services.*.expose[0]' "${s3Manifest}" | grep -v null)"
Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List
children needs to be eagerly initialized?
The children
field is already eagerly initialized at the application layer, but this looks like the relation at the database layer and the lazy fetch is the problem. I don't think we want to eagerly fetch, because not every time we query for a particular DiscoveryNode do we also want to fetch all its child nodes - especially because this would cause a recursive eager query, so you would have to pull the node and its whole subtree out of the database every single time.
Maybe there is a missing @Transactional
/QuarkusTransaction.joiningExisting()
? Or maybe the thread that is actually running during the delete operation is not the same expected thread that was initially within the transaction context?
https://www.linkedin.com/pulse/hibernate-lazy-initialization-exception-spring-sara-alhosin/
That's interesting. What causes the transaction to have to be retried... ? That might be more telling as a root cause of some underlying problem. Are transactions frequently failing because of some constraint violations?
I think there are some failures in operations with active recordings as I see in logs...
Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List children needs to be eagerly initialized?
The
children
field is already eagerly initialized at the application layer, but this looks like the relation at the database layer and the lazy fetch is the problem. I don't think we want to eagerly fetch, because not every time we query for a particular DiscoveryNode do we also want to fetch all its child nodes - especially because this would cause a recursive eager query, so you would have to pull the node and its whole subtree out of the database every single time.Maybe there is a missing
@Transactional
/QuarkusTransaction.joiningExisting()
? Or maybe the thread that is actually running during the delete operation is not the same expected thread that was initially within the transaction context?https://www.linkedin.com/pulse/hibernate-lazy-initialization-exception-spring-sara-alhosin/
Right, that makes sense! I can try them out. It seems that this issue also occurs in #428. So likely the way I set it up...
I think there are some failures in operations with active recordings as I see in logs...
Are they transaction failures? I guess they don't have to be, they could be exceptions that get thrown during the context of a transaction and force a rollback...
Maybe that means that within these discovery classes we shouldn't use joiningExisting()
and instead use requiringNew()
? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that.
Maybe that means that within these discovery classes we shouldn't use joiningExisting() and instead use requiringNew()? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that
I was wondering about that too. So should we use requireNew
? I read in the docs that it will suspend any existing transactions and continue them when the new one is committed: https://quarkus.io/guides/transaction.
Are they transaction failures? I guess they don't have to be, they could be exceptions that get thrown during the context of a transaction and force a rollback...
These below exceptions are pretty common.
WARN: ARJUNA012125: TwoPhaseCoordinator.beforeCompletion - failed for SynchronizationImple< 0:ffff0a590103:993d:6633f303:23, org.hibernate.resource.transaction.backend.jta.internal.synchronization.RegisteredSynchronization@60d60c7 >
org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
Detail: Key (target_id, remoteid)=(1, 1) already exists.] [insert into ActiveRecording (continuous,duration,external,maxAge,maxSize,metadata,name,remoteId,startTime,state,target_id,toDisk,id) values (?,?,?,?,?,?,?,?,?,?,?,?,?)]
at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:95)
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:56)
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:108)
at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:278)
at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.performNonBatchedMutation(AbstractMutationExecutor.java:107)
at org.hibernate.engine.jdbc.mutation.internal.MutationExecutorSingleNonBatched.performNonBatchedOperations(MutationExecutorSingleNonBatched.java:40)
at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.execute(AbstractMutationExecutor.java:52)
at org.hibernate.persister.entity.mutation.InsertCoordinator.doStaticInserts(InsertCoordinator.java:171)
at org.hibernate.persister.entity.mutation.InsertCoordinator.coordinateInsert(InsertCoordinator.java:112)
at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2790)
at org.hibernate.action.internal.EntityInsertAction.execute(EntityInsertAction.java:102)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:635)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:502)
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:364)
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:39)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:127)
at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1412)
at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:485)
at org.hibernate.internal.SessionImpl.flushBeforeTransactionCompletion(SessionImpl.java:2301)
at org.hibernate.internal.SessionImpl.beforeTransactionCompletion(SessionImpl.java:1966)
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.beforeTransactionCompletion(JdbcCoordinatorImpl.java:439)
at org.hibernate.resource.transaction.backend.jta.internal.JtaTransactionCoordinatorImpl.beforeCompletion(JtaTransactionCoordinatorImpl.java:336)
at org.hibernate.resource.transaction.backend.jta.internal.synchronization.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:47)
at org.hibernate.resource.transaction.backend.jta.internal.synchronization.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:37)
at com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple.beforeCompletion(SynchronizationImple.java:76)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.beforeCompletion(TwoPhaseCoordinator.java:360)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:91)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
at io.quarkus.narayana.jta.runtime.NotifyingTransactionManager.commit(NotifyingTransactionManager.java:70)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.endTransaction(TransactionalInterceptorBase.java:406)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:175)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:107)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.doIntercept(TransactionalInterceptorRequired.java:38)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.intercept(TransactionalInterceptorBase.java:61)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.intercept(TransactionalInterceptorRequired.java:32)
at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired_Bean.intercept(Unknown Source)
at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:27)
at io.cryostat.targets.Target_Listener_Subclass.onMessage(Unknown Source)
at io.cryostat.targets.Target_Listener_ClientProxy.onMessage(Unknown Source)
at io.cryostat.targets.Target_Listener_VertxInvoker_onMessage_b83c115f51840ecd28be411c7e7d10c81b79ab6d.invokeBean(Unknown Source)
at io.quarkus.vertx.runtime.EventConsumerInvoker.invoke(EventConsumerInvoker.java:41)
at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:116)
at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:112)
at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:180)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:277)
at io.vertx.core.impl.ContextBase.lambda$internalExecuteBlocking$2(ContextBase.java:199)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1512)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
Detail: Key (target_id, remoteid)=(1, 1) already exists.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2713)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2401)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:152)
at io.agroal.pool.wrapper.PreparedStatementWrapper.executeUpdate(PreparedStatementWrapper.java:88)
at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:275)
... 53 more
I was wondering about that too. So should we use requireNew? I read in the docs that it will suspend any existing transactions and continue them when the new one is committed
Worth trying, I think. I would just be careful about anywhere that will create nested transactions since this will not flatten them into a shared context anymore, so there will be additional overhead. On the plus side, it should also help ensure that individual transaction failures (individual nodes not being persistable) are isolated from others - ex. a target node (Podman container) failing to persist should not block its parent (Podman Realm) node from being persisted.
These below exceptions are pretty common.
Are these happening in the background or are you performing some particular task when you see them? And do they occur on main
or just this PR?
Worth trying, I think. I would just be careful about anywhere that will create nested transactions since this will not flatten them into a shared context anymore, so there will be additional overhead. On the plus side, it should also help ensure that individual transaction failures (individual nodes not being persistable) are isolated from others - ex. a target node (Podman container) failing to persist should not block its parent (Podman Realm) node from being persisted.
Sounds good! I am trying now...
Are these happening in the background or are you performing some particular task when you see them? And do they occur on main or just this PR?
These occur in the background though. And yea I just run off the lastest image on main
and see the same thing.
These occur in the background though. And yea I just run off the lastest image on main and see the same thing.
Ahh is this when you are doing smoketest.bash -OV
still? If so then I think that makes sense - ActiveRecording
s are supposed to have a unique composite key formed by the <targetId, recording.remoteId>
. When Cryostat discovers a target it tries to connect to it to scan what active recordings it already has, which could have come from ex. -XX:StartFlightRecording
.
In this case, the targetId
is probably going to end up the same, and the remoteId
is also going to end up the same since the target JVM just numbers those sequentially - the first recording started always has ID 1, and the next time we stop and restart the smoketest, the JVM will again begin with ID 1...
Hmm. I think there are probably more things like this that are broken for similar reasons.
Ahh is this when you are doing smoketest.bash -OV still? If so then I think that makes sense - ActiveRecordings are supposed to have a unique composite key formed by the <targetId, recording.remoteId>. When Cryostat discovers a target it tries to connect to it to scan what active recordings it already has, which could have come from ex. -XX:StartFlightRecording.
Oh yes I have been running with bash smoketest.bash -OV
. Now that explains why.
Maybe that means that within these discovery classes we shouldn't use joiningExisting() and instead use requiringNew()? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that
Ah btw, creating a new transaction for the handler does not yet solve the issue with org.hibernate.LazyInitializationException
...
Oh yes I have been running with bash smoketest.bash -OV. Now that explains why.
That's good to know then. I guess the database handling is not yet mature enough to actually be able to reuse a database from a previous run, so it should still be set to drop-and-create
. Unfortunately this does mean that things like stored credentials and automated rules will also get lost every time the Cryostat container is restarted, so it would be good to get this fixed before release.
Had to retry a couple time to ensure its not just a fluke. I tried with bash smoketest.bash
(remove all volumes before that) and still seeing the issue. The target_id seems to be cryostat itself (service:jmx:rmi:///jndi/rmi://localhost:0/jmxrmi
)
ERROR: ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
Detail: Key (target_id, remoteid)=(2, 1) already exists.
Could you try either editing the cryostat.yml or setting environment variables to disable the JDP discovery? This might also happen because Cryostat discovers itself both by JDP and by ContainerDiscovery.
---- On Thu, 02 May 2024 17:35:28 -0400 @.*** wrote ----
Oh yes I have been running with bash smoketest.bash -OV. Now that explains why.
That's good to know then. I guess the database handling is not yet mature enough to actually be able to reuse a database from a previous run, so it should still be set to drop-and-create. Unfortunately this does mean that things like stored credentials and automated rules will also get lost every time the Cryostat container is restarted, so it would be good to get this fixed before release.
hmm that being said, I have just tried with just bash smoketest.bash and still seeing the same errors. So it might be more than just existing volume?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because your review was requested.
^ would be #71
Could you try either editing the cryostat.yml or setting environment variables to disable the JDP discovery? This might also happen because Cryostat discovers itself both by JDP and by ContainerDiscovery.
~Actually, I forgot to mention that I also tried with CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash
and still saw the error. I have tried quite a couple times now. Though, at times, this error does not occur (very much intermittent). Related to #415?~
Nvmind! You are right! Only happened with both JDP and ContainerDiscovery. Just run a couple more times, I am not seeing it anymore now! Not sure what happened: Could just be a typo in the env var or I got lost when switching around. Sorry haha :D
EDITTED: Can also confirm that k8s deployment sees no such error.
Okayy that issue aside, I think I fixed the transaction problem? The detection and pruning works fine now :D
It's rather a silly mistake. Because I fetched the target in the previous transaction (handleObservedContainers
) and pass it in the event message (ContainerDiscoveryEvent
), then during deletion in the event consumer, that target is already outdated. Fetching it again using the connectUrl solved the issue as in the commit :D
Though, discovery is noticeably slower than before. I guess that should be fine as we discussed above :D
/build_test
Workflow started at 5/3/2024, 12:26:09 PM. View Actions Run.
No OpenAPI schema changes detected.
No GraphQL schema changes detected.
CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8941969954
/build_test
Workflow started at 5/3/2024, 2:37:27 PM. View Actions Run.
No GraphQL schema changes detected.
No OpenAPI schema changes detected.
CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8943413897
Welcome to Cryostat3! 👋
Before contributing, make sure you have:
main
branch[chore, ci, docs, feat, fix, test]
To recreate commits with GPG signature
git fetch upstream && git rebase --force --gpg-sign upstream/main
Related to #420
Description of the change:
Motivation for the change:
See https://github.com/cryostatio/cryostat3/pull/420#issuecomment-2079879609. This fixes the problem when cryostat is restarted, observed targets will cause duplicate key violation and removed targets remain in database (stale).
How to manually test