cryostatio / cryostat

Secure JDK Flight Recorder management for containerized JVMs
https://cryostat.io
Other
8 stars 8 forks source link

fix(discovery): observed containers should be checked with persisted nodes #423

Closed tthvo closed 2 months ago

tthvo commented 2 months ago

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Related to #420

Description of the change:

Motivation for the change:

See https://github.com/cryostatio/cryostat3/pull/420#issuecomment-2079879609. This fixes the problem when cryostat is restarted, observed targets will cause duplicate key violation and removed targets remain in database (stale).

How to manually test

mvn clean package -DskipTests=true
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -O
andrewazores commented 2 months ago

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

tthvo commented 2 months ago

~Discovery works fine with podman while checks are done against database~ Though, I have run into these 2 issues that I need some advice on:

tthvo commented 2 months ago

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

tthvo commented 2 months ago

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]
tthvo commented 2 months ago

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

Looks good to me!! It was very helpful! I tried with these steps to test this PR:

# Initial setup
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

podman pod create --replace --name cryostat-pod

podman run \
        --name jmxquarkus \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxPort="51423" \
        --env QUARKUS_HTTP_PORT=10012 \
        --rm -d quay.io/roberttoyonaga/jmx:jmxquarkus@sha256:b067f29faa91312d20d43c55d194a2e076de7d0d094da3d43ee7d2b2b5a6f100

podman run \
        --name vertx-fib-demo-0 \
        --env HTTP_PORT=8079 \
        --env JMX_PORT=9089 \
        --env START_DELAY=60 \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxHost="vertx-fib-demo-0" \
        --label io.cryostat.jmxPort="9089" \
        --rm -d quay.io/andrewazores/vertx-fib-demo:0.13.1

# Kill containers and restart
# The discovery tree should remain the same
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

# Now, kill the containers again
# Remove the pod
podman pod rm cryostat-pod -f

# Start again
# The pod and its containers should be pruned
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: https://github.com/cryostatio/cryostat3/pull/423#issuecomment-2080168900

andrewazores commented 2 months ago

There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.

Yea, that was semi-intentional back when I first put all that together just so I could test that Cryostat/the database would properly reject the definitions due to the duplicate URLs. I think by now that's already well-proven, so we can change it to do something different - give the containers different hostnames to be referenced by, or just turn off one of the discovery mechanisms in the smoketest by default so they don't overlap/collide.

The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?

It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.

Anyway, if we do determine there are performance issues due to database accesses and the associated latency, Quarkus/Hibernate/JPA has a lot of powerful caching capabilities:

https://quarkus.io/guides/hibernate-orm#caching

so it seems like some judicious application of the @Cacheable annotation would probably help solve that, if we need it to be solved.

andrewazores commented 2 months ago

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: https://github.com/cryostatio/cryostat3/pull/423#issuecomment-2080168900

This means that if there are many worker threads invoked by the informer, all those threads will get blocked waiting for the lock to perform their database transaction, which will be relatively slow because they're waiting for database I/O.

I suggest another approach: have the callback (invoked by worker threads?) just add the event to a blocking queue for processing, and have a dedicated thread that takes items from the queue as they become available and processes them in a transaction. This way there is only one thread that is spending a lot of its time waiting for database I/O before moving on to process the next event and waiting for the database again, and the other threads are free to do other work.

It would be a similar pattern to how I wrote the WebSocket MessagingServer:

"event" (notification to be sent): https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L144

task queue processor: https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L95


It is probably also possible to achieve this using the EventBus alone without an explicit ExecutorService, with something like this pseudocode:

class ContainerDiscovery {

onStart() {
  informer.listen(evt -> bus.publish(ContainerDiscovery.class.getName(), evt);
}

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}
tthvo commented 2 months ago

Sounds good to me! Just need to look into it a little deeper so I am marking this draft for now. I suppose we also want to apply these changes to kubeAPI also?

andrewazores commented 2 months ago

I think that makes sense.

tthvo commented 2 months ago

/build_test

github-actions[bot] commented 2 months ago

Workflow started at 5/2/2024, 12:57:34 AM. View Actions Run.

github-actions[bot] commented 2 months ago

No OpenAPI schema changes detected.

github-actions[bot] commented 2 months ago

No GraphQL schema changes detected.

github-actions[bot] commented 2 months ago

CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8918817387

tthvo commented 2 months ago

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}```

Unfortunately, for this case, we still need the order = true as it ensures pod node is persisted once. Otherwise, multiple discovery FOUND event messages will persist the pod node multiple times. That causes the same issue as #420. I guess its performance now is similar to synchronized keyword?

ie. if you don't specify value = "pkg.Class" in the @ConsumeEvent, it defaults to the class name. Then you can just publish to the class name.

Phew, took me a while to realize that the @ConsumeEvent must be specified on the sub-class so I have to do a workaround to allow sub-class to specify its own address. This also helps with separating podman and docker discovery event.

I think this should be working now? The changes are such that:

But, seems like when using bash smoketest.bash -OV, there are initially no LOST message emitted from ContainerDiscovery, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...

tthvo commented 2 months ago

Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish is called within a transaction so if current one is retried, the same message will be sent again. Tho, this PR also handles that :D

tthvo commented 2 months ago

Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List<DiscoveryNode> children needs to be eagerly initialized?

ERROR: Uncaught exception received by Vert.x
org.hibernate.LazyInitializationException: Unable to perform requested lazy initialization [io.cryostat.discovery.DiscoveryNode.children] - session is closed and settings disallow loading outside the Session
    at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementHelper.throwLazyInitializationException(EnhancementHelper.java:275)
    at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementHelper.performWork(EnhancementHelper.java:174)
    at org.hibernate.bytecode.enhance.spi.interceptor.EnhancementAsProxyLazinessInterceptor.handleRead(EnhancementAsProxyLazinessInterceptor.java:102)
    at org.hibernate.bytecode.enhance.spi.interceptor.AbstractInterceptor.readObject(AbstractInterceptor.java:152)
    at io.cryostat.discovery.DiscoveryNode.$$_hibernate_read_children(DiscoveryNode.java)
    at io.cryostat.discovery.DiscoveryNode.getChildren(DiscoveryNode.java)
    at io.cryostat.discovery.ContainerDiscovery.updateDiscoveryTree(ContainerDiscovery.java:489)
    at io.cryostat.discovery.PodmanDiscovery.handleContainerEvent(ContainerDiscovery.java:109)
    at io.cryostat.discovery.PodmanDiscovery_Subclass.handleContainerEvent$$superforward(Unknown Source)
    at io.cryostat.discovery.PodmanDiscovery_Subclass$$function$$1.apply(Unknown Source)
    at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:73)
    at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:62)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:136)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:107)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.doIntercept(TransactionalInterceptorRequired.java:38)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.intercept(TransactionalInterceptorBase.java:61)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.intercept(TransactionalInterceptorRequired.java:32)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired_Bean.intercept(Unknown Source)
    at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
    at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
    at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:27)
    at io.cryostat.discovery.PodmanDiscovery_Subclass.handleContainerEvent(Unknown Source)
    at io.cryostat.discovery.PodmanDiscovery_ClientProxy.handleContainerEvent(Unknown Source)
    at io.cryostat.discovery.PodmanDiscovery_VertxInvoker_handleContainerEvent_151fd884ccd88ca4e8217a3be771267e67b76ed9.invokeBean(Unknown Source)
    at io.quarkus.vertx.runtime.EventConsumerInvoker.invoke(EventConsumerInvoker.java:41)
    at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:116)
    at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:112)
    at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:180)
    at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:277)
    at io.vertx.core.impl.ContextBase.lambda$internalExecuteBlocking$2(ContextBase.java:199)
    at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:77)
    at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
    at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
    at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
    at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
    at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
andrewazores commented 2 months ago

I guess its performance now is similar to synchronized keyword?

In terms of only allowing one event to be processed at a time, yes I think so. But I think it is still leaving more threads from the worker pool free to perform other work, rather than waiting for the lock to free so they can do the work, so it should perform better in particular if there are multiple CPUs available to actually do things in parallel.

andrewazores commented 2 months ago

Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish is called within a transaction so if current one is retried, the same message will be sent again.

That's interesting. What causes the transaction to have to be retried... ? That might be more telling as a root cause of some underlying problem. Are transactions frequently failing because of some constraint violations?

andrewazores commented 2 months ago

But, seems like when using bash smoketest.bash -OV, there are initially no LOST message emitted from ContainerDiscovery, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...

I think this is probably because of this smoketest configuration:

https://github.com/cryostatio/cryostat3/blob/8c32501947376a3d87fdf0965bf85f831b8becba/compose/db.yml#L5

The usual behaviour of the smoketest is to start from scratch completely, so this line actually resets the database even if there is data left behind in the storage volume.

https://quarkus.io/guides/hibernate-orm#dev-mode

Maybe a patch like this will help:

diff --git a/compose/db.yml b/compose/db.yml
index 8c5fc19..aea89af 100644
--- a/compose/db.yml
+++ b/compose/db.yml
@@ -2,7 +2,7 @@ version: "3"
 services:
   cryostat:
     environment:
-      QUARKUS_HIBERNATE_ORM_DATABASE_GENERATION: drop-and-create
+      QUARKUS_HIBERNATE_ORM_DATABASE_GENERATION: ${DATABASE_GENERATION:-drop-and-create}
       QUARKUS_DATASOURCE_USERNAME: cryostat3
       QUARKUS_DATASOURCE_PASSWORD: cryostat3
       QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://db:5432/cryostat3
diff --git a/smoketest.bash b/smoketest.bash
index 2c1e694..03c22b8 100755
--- a/smoketest.bash
+++ b/smoketest.bash
@@ -65,6 +65,7 @@ while getopts "hs:prGtOVXcbn" opt; do
             ;;
         V)
             KEEP_VOLUMES=true
+            DATABASE_GENERATION=update
             ;;
         X)
             FILES+=("${DIR}/compose/db-viewer.yml")
@@ -111,6 +112,7 @@ fi
 export CRYOSTAT_HTTP_HOST
 export CRYOSTAT_HTTP_PORT
 export GRAFANA_DASHBOARD_EXT_URL
+export DATABASE_GENERATION

 s3Manifest="${DIR}/compose/s3-${s3}.yml"
 STORAGE_PORT="$(yq '.services.*.expose[0]' "${s3Manifest}" | grep -v null)"
andrewazores commented 2 months ago

Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List children needs to be eagerly initialized?

The children field is already eagerly initialized at the application layer, but this looks like the relation at the database layer and the lazy fetch is the problem. I don't think we want to eagerly fetch, because not every time we query for a particular DiscoveryNode do we also want to fetch all its child nodes - especially because this would cause a recursive eager query, so you would have to pull the node and its whole subtree out of the database every single time.

Maybe there is a missing @Transactional/QuarkusTransaction.joiningExisting()? Or maybe the thread that is actually running during the delete operation is not the same expected thread that was initially within the transaction context?

https://stackoverflow.com/questions/21574236/how-to-fix-org-hibernate-lazyinitializationexception-could-not-initialize-prox

https://www.linkedin.com/pulse/hibernate-lazy-initialization-exception-spring-sara-alhosin/

tthvo commented 2 months ago

That's interesting. What causes the transaction to have to be retried... ? That might be more telling as a root cause of some underlying problem. Are transactions frequently failing because of some constraint violations?

I think there are some failures in operations with active recordings as I see in logs...

Ah actually, I am running a bit of problem when targets are being deleted. However, deletion on start-up works just fine. Any ideas? Maybe the List children needs to be eagerly initialized?

The children field is already eagerly initialized at the application layer, but this looks like the relation at the database layer and the lazy fetch is the problem. I don't think we want to eagerly fetch, because not every time we query for a particular DiscoveryNode do we also want to fetch all its child nodes - especially because this would cause a recursive eager query, so you would have to pull the node and its whole subtree out of the database every single time.

Maybe there is a missing @Transactional/QuarkusTransaction.joiningExisting()? Or maybe the thread that is actually running during the delete operation is not the same expected thread that was initially within the transaction context?

https://stackoverflow.com/questions/21574236/how-to-fix-org-hibernate-lazyinitializationexception-could-not-initialize-prox

https://www.linkedin.com/pulse/hibernate-lazy-initialization-exception-spring-sara-alhosin/

Right, that makes sense! I can try them out. It seems that this issue also occurs in #428. So likely the way I set it up...

andrewazores commented 2 months ago

I think there are some failures in operations with active recordings as I see in logs...

Are they transaction failures? I guess they don't have to be, they could be exceptions that get thrown during the context of a transaction and force a rollback...

Maybe that means that within these discovery classes we shouldn't use joiningExisting() and instead use requiringNew()? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that.

tthvo commented 2 months ago

Maybe that means that within these discovery classes we shouldn't use joiningExisting() and instead use requiringNew()? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that

I was wondering about that too. So should we use requireNew? I read in the docs that it will suspend any existing transactions and continue them when the new one is committed: https://quarkus.io/guides/transaction.

Are they transaction failures? I guess they don't have to be, they could be exceptions that get thrown during the context of a transaction and force a rollback...

These below exceptions are pretty common.

WARN: ARJUNA012125: TwoPhaseCoordinator.beforeCompletion - failed for SynchronizationImple< 0:ffff0a590103:993d:6633f303:23, org.hibernate.resource.transaction.backend.jta.internal.synchronization.RegisteredSynchronization@60d60c7 >
org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
  Detail: Key (target_id, remoteid)=(1, 1) already exists.] [insert into ActiveRecording (continuous,duration,external,maxAge,maxSize,metadata,name,remoteId,startTime,state,target_id,toDisk,id) values (?,?,?,?,?,?,?,?,?,?,?,?,?)]
    at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:95)
    at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:56)
    at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:108)
    at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:278)
    at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.performNonBatchedMutation(AbstractMutationExecutor.java:107)
    at org.hibernate.engine.jdbc.mutation.internal.MutationExecutorSingleNonBatched.performNonBatchedOperations(MutationExecutorSingleNonBatched.java:40)
    at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.execute(AbstractMutationExecutor.java:52)
    at org.hibernate.persister.entity.mutation.InsertCoordinator.doStaticInserts(InsertCoordinator.java:171)
    at org.hibernate.persister.entity.mutation.InsertCoordinator.coordinateInsert(InsertCoordinator.java:112)
    at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2790)
    at org.hibernate.action.internal.EntityInsertAction.execute(EntityInsertAction.java:102)
    at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:635)
    at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:502)
    at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:364)
    at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:39)
    at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:127)
    at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1412)
    at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:485)
    at org.hibernate.internal.SessionImpl.flushBeforeTransactionCompletion(SessionImpl.java:2301)
    at org.hibernate.internal.SessionImpl.beforeTransactionCompletion(SessionImpl.java:1966)
    at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.beforeTransactionCompletion(JdbcCoordinatorImpl.java:439)
    at org.hibernate.resource.transaction.backend.jta.internal.JtaTransactionCoordinatorImpl.beforeCompletion(JtaTransactionCoordinatorImpl.java:336)
    at org.hibernate.resource.transaction.backend.jta.internal.synchronization.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:47)
    at org.hibernate.resource.transaction.backend.jta.internal.synchronization.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:37)
    at com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple.beforeCompletion(SynchronizationImple.java:76)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.beforeCompletion(TwoPhaseCoordinator.java:360)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:91)
    at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:162)
    at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1295)
    at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:128)
    at io.quarkus.narayana.jta.runtime.NotifyingTransactionManager.commit(NotifyingTransactionManager.java:70)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.endTransaction(TransactionalInterceptorBase.java:406)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:175)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.invokeInOurTx(TransactionalInterceptorBase.java:107)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.doIntercept(TransactionalInterceptorRequired.java:38)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.intercept(TransactionalInterceptorBase.java:61)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired.intercept(TransactionalInterceptorRequired.java:32)
    at io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorRequired_Bean.intercept(Unknown Source)
    at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
    at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
    at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:27)
    at io.cryostat.targets.Target_Listener_Subclass.onMessage(Unknown Source)
    at io.cryostat.targets.Target_Listener_ClientProxy.onMessage(Unknown Source)
    at io.cryostat.targets.Target_Listener_VertxInvoker_onMessage_b83c115f51840ecd28be411c7e7d10c81b79ab6d.invokeBean(Unknown Source)
    at io.quarkus.vertx.runtime.EventConsumerInvoker.invoke(EventConsumerInvoker.java:41)
    at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:116)
    at io.quarkus.vertx.runtime.VertxRecorder$3$1$1.handle(VertxRecorder.java:112)
    at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:180)
    at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:277)
    at io.vertx.core.impl.ContextBase.lambda$internalExecuteBlocking$2(ContextBase.java:199)
    at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)
    at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
    at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1512)
    at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
    at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
  Detail: Key (target_id, remoteid)=(1, 1) already exists.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2713)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2401)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
    at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:152)
    at io.agroal.pool.wrapper.PreparedStatementWrapper.executeUpdate(PreparedStatementWrapper.java:88)
    at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:275)
    ... 53 more
andrewazores commented 2 months ago

I was wondering about that too. So should we use requireNew? I read in the docs that it will suspend any existing transactions and continue them when the new one is committed

Worth trying, I think. I would just be careful about anywhere that will create nested transactions since this will not flatten them into a shared context anymore, so there will be additional overhead. On the plus side, it should also help ensure that individual transaction failures (individual nodes not being persistable) are isolated from others - ex. a target node (Podman container) failing to persist should not block its parent (Podman Realm) node from being persisted.

These below exceptions are pretty common.

Are these happening in the background or are you performing some particular task when you see them? And do they occur on main or just this PR?

tthvo commented 2 months ago

Worth trying, I think. I would just be careful about anywhere that will create nested transactions since this will not flatten them into a shared context anymore, so there will be additional overhead. On the plus side, it should also help ensure that individual transaction failures (individual nodes not being persistable) are isolated from others - ex. a target node (Podman container) failing to persist should not block its parent (Podman Realm) node from being persisted.

Sounds good! I am trying now...

Are these happening in the background or are you performing some particular task when you see them? And do they occur on main or just this PR?

These occur in the background though. And yea I just run off the lastest image on main and see the same thing.

andrewazores commented 2 months ago

These occur in the background though. And yea I just run off the lastest image on main and see the same thing.

Ahh is this when you are doing smoketest.bash -OV still? If so then I think that makes sense - ActiveRecordings are supposed to have a unique composite key formed by the <targetId, recording.remoteId>. When Cryostat discovers a target it tries to connect to it to scan what active recordings it already has, which could have come from ex. -XX:StartFlightRecording.

In this case, the targetId is probably going to end up the same, and the remoteId is also going to end up the same since the target JVM just numbers those sequentially - the first recording started always has ID 1, and the next time we stop and restart the smoketest, the JVM will again begin with ID 1...

Hmm. I think there are probably more things like this that are broken for similar reasons.

tthvo commented 2 months ago

Ahh is this when you are doing smoketest.bash -OV still? If so then I think that makes sense - ActiveRecordings are supposed to have a unique composite key formed by the <targetId, recording.remoteId>. When Cryostat discovers a target it tries to connect to it to scan what active recordings it already has, which could have come from ex. -XX:StartFlightRecording.

Oh yes I have been running with bash smoketest.bash -OV. Now that explains why.

tthvo commented 2 months ago

Maybe that means that within these discovery classes we shouldn't use joiningExisting() and instead use requiringNew()? I wouldn't think that these discovery events and the other active recording operations are happening on the same threads or in the same transaction contexts, but I don't know enough to say they definitely aren't. But I also don't think that will have the same nice "nested operation flattening" semantics, so we would need to be careful about that

Ah btw, creating a new transaction for the handler does not yet solve the issue with org.hibernate.LazyInitializationException...

andrewazores commented 2 months ago

Oh yes I have been running with bash smoketest.bash -OV. Now that explains why.

That's good to know then. I guess the database handling is not yet mature enough to actually be able to reuse a database from a previous run, so it should still be set to drop-and-create. Unfortunately this does mean that things like stored credentials and automated rules will also get lost every time the Cryostat container is restarted, so it would be good to get this fixed before release.

tthvo commented 2 months ago

Had to retry a couple time to ensure its not just a fluke. I tried with bash smoketest.bash (remove all volumes before that) and still seeing the issue. The target_id seems to be cryostat itself (service:jmx:rmi:///jndi/rmi://localhost:0/jmxrmi)

ERROR: ERROR: duplicate key value violates unique constraint "ukr8nr64n7i34ipp019xrbbbyeh"
  Detail: Key (target_id, remoteid)=(2, 1) already exists.
andrewazores commented 2 months ago

Could you try either editing the cryostat.yml or setting environment variables to disable the JDP discovery? This might also happen because Cryostat discovers itself both by JDP and by ContainerDiscovery.

---- On Thu, 02 May 2024 17:35:28 -0400 @.*** wrote ----

Oh yes I have been running with bash smoketest.bash -OV. Now that explains why.

That's good to know then. I guess the database handling is not yet mature enough to actually be able to reuse a database from a previous run, so it should still be set to drop-and-create. Unfortunately this does mean that things like stored credentials and automated rules will also get lost every time the Cryostat container is restarted, so it would be good to get this fixed before release.

hmm that being said, I have just tried with just bash smoketest.bash and still seeing the same errors. So it might be more than just existing volume?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because your review was requested.

andrewazores commented 2 months ago

^ would be #71

tthvo commented 2 months ago

Could you try either editing the cryostat.yml or setting environment variables to disable the JDP discovery? This might also happen because Cryostat discovers itself both by JDP and by ContainerDiscovery.

~Actually, I forgot to mention that I also tried with CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash and still saw the error. I have tried quite a couple times now. Though, at times, this error does not occur (very much intermittent). Related to #415?~

Nvmind! You are right! Only happened with both JDP and ContainerDiscovery. Just run a couple more times, I am not seeing it anymore now! Not sure what happened: Could just be a typo in the env var or I got lost when switching around. Sorry haha :D

EDITTED: Can also confirm that k8s deployment sees no such error.

tthvo commented 2 months ago

Okayy that issue aside, I think I fixed the transaction problem? The detection and pruning works fine now :D

It's rather a silly mistake. Because I fetched the target in the previous transaction (handleObservedContainers) and pass it in the event message (ContainerDiscoveryEvent), then during deletion in the event consumer, that target is already outdated. Fetching it again using the connectUrl solved the issue as in the commit :D

Though, discovery is noticeably slower than before. I guess that should be fine as we discussed above :D

tthvo commented 2 months ago

/build_test

github-actions[bot] commented 2 months ago

Workflow started at 5/3/2024, 12:26:09 PM. View Actions Run.

github-actions[bot] commented 2 months ago

No OpenAPI schema changes detected.

github-actions[bot] commented 2 months ago

No GraphQL schema changes detected.

github-actions[bot] commented 2 months ago

CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8941969954

andrewazores commented 2 months ago

/build_test

github-actions[bot] commented 2 months ago

Workflow started at 5/3/2024, 2:37:27 PM. View Actions Run.

github-actions[bot] commented 2 months ago

No GraphQL schema changes detected.

github-actions[bot] commented 2 months ago

No OpenAPI schema changes detected.

github-actions[bot] commented 2 months ago

CI build and push: All tests pass ✅ (JDK17) https://github.com/cryostatio/cryostat3/actions/runs/8943413897