NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I want to understand why Artemis 2.35 is not working properly #174

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 131786, https://vlab.noaa.gov/redmine/issues/131786 Original Date: 2024-06-25 Original Assignee: Hank


See #131672. If I use Artemis 2.34, the service appears to run properly: I can bring it up, post smoke tests, and see successes. If I use 2.35, subscribers fail to negotiate; example exception summary:

2024-06-25T13:43:45.936+0000 ERROR EvaluationStatusTracker While tracking evaluation 9uGAGVwNp0RbPcbkbk-ETiB8p-4, encountered an error in a messaging client that prevented the evaluation from succeeding. The error message is: client_id: "CODjKPSqSD14nQy37vqUdAe_sL0" time { seconds: 1719323025 nanos: 931731891 } completion_status: EVALUATION_COMPLETE_REPORTED_FAILURE status_events { status_level: ERROR event_message: "class wres.events.subscribe.SubscriberTimedOutException: Failed to negotiate all subscriptions within the timeout period of PT5M. Subscribers could not be negotiated for the following output formats: [CSV2]" }

This ticket can be resolved once we have identified the issue and fixed it,

Hank


Redmine related issue(s): 131911


epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T15:09:36Z


The release notes for 2.35 are here:

https://activemq.apache.org/components/artemis/download/release-notes-2.35.0

Looking over them,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T15:14:52Z


Best guess is this one:

@ARTEMIS-4811 - Upgrade Netty to 4.1.111.Final@

I think we have some (possibly legacy) overrides in our @build.gradle@ to address (legacy-version) netty cves. You could try removing these:

            // Because of CVE-2022-24823. TODO: remove when Artemis catches up
            exclude group: 'io.netty'

TOGETHER WITH these:

        // Because of CVE-2022-24823. TODO: remove when Artemis catches up
        implementation group: 'io.netty', name: 'netty-all', version: '4.1.109.Final'
</code>

I'd be (edit: slightly) surprised if it's that simple, but you never know.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T15:22:01Z


Thanks, James. I'll try to build the images in a few minutes.

Note that -dev03 migration to RHEL 8 is supposed to happen today, so I might be pulled aside by that work soon.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T15:55:49Z


There are other references to that CVE related to Redisson. Here is an example for the tasker:

1375         // Because of CVE-2022-24823. TODO: remove when Redisson catches up
1376         implementation group: 'io.netty', name: 'netty-all', version: '4.1.109.Final'

Do you propose that I remove all similar references, including for Redisson and Qpid, as well? Or just in the events broker where Artemis is referred to?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T15:57:51Z


Oh, no, just Artemis. Each of these is to address the vuln in relation to a specific dependency that relies on the vulnerable netty. For Artemis, it looks like it is OBE, not the others.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T16:05:03Z


Making the change and building/unit testing now. That change is to the wres-eventsbroker @project@, which I think will require a new .zip file be built for the core, wres-vis, and wres-writing. That means I'll have to manually rebuild the images for the worker, graphics, and writing containers, which will be fun.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T16:17:34Z


Something looks fishy in the build. Its been running for 12m+ with "2 tests completed, 2 failed", and repetitions of messages similar to this:

    16:14:46.980 [Thread-15 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@52ca2652)] DEBUG org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl -- Message AMQPStandardMessage( [durable=false, messageID=396, address=BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true', size=284, scanningStatus=SCANNED, applicationProperties={}, messageAnnotations={x-opt-jms-dest=1, x-opt-jms-msg-type=3}, properties=Properties{messageId=ID:06aee21f-4ee8-474a-84e9-e64882e0ed7f:26:1:1-15, userId=null, to='BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true'', subject='null', replyTo='null', correlationId=W-tile5yGYYReUW3F9CQ--j9ahQ, contentType=application/octet-stream, contentEncoding=null, absoluteExpiryTime=null, creationTime=Tue Jun 25 16:14:46 UTC 2024, groupId='null', groupSequence=null, replyToGroupId='null'}, extraProperties = null] is not going anywhere as it didn't have a binding on address:BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true'
    16:14:46.980 [Thread-15 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@52ca2652)] DEBUG org.apache.activemq.artemis.core.server.impl.ServerSessionImpl -- Routing result for AMQPStandardMessage( [durable=false, messageID=396, address=BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true', size=284, scanningStatus=SCANNED, applicationProperties={}, messageAnnotations={x-opt-jms-dest=1, x-opt-jms-msg-type=3}, properties=Properties{messageId=ID:06aee21f-4ee8-474a-84e9-e64882e0ed7f:26:1:1-15, userId=null, to='BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true'', subject='null', replyTo='null', correlationId=W-tile5yGYYReUW3F9CQ--j9ahQ, contentType=application/octet-stream, contentEncoding=null, absoluteExpiryTime=null, creationTime=Tue Jun 25 16:14:46 UTC 2024, groupId='null', groupSequence=null, replyToGroupId='null'}, extraProperties = null] = NO_BINDINGS
    16:14:46.980 [Thread-15 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@52ca2652)] INFO org.apache.activemq.audit.message -- AMQ601500: User anonymous@127.0.0.1:33236 sent a message AMQPStandardMessage( [durable=false, messageID=396, address=BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true', size=284, scanningStatus=SCANNED, applicationProperties={}, messageAnnotations={x-opt-jms-dest=1, x-opt-jms-msg-type=3}, properties=Properties{messageId=ID:06aee21f-4ee8-474a-84e9-e64882e0ed7f:26:1:1-15, userId=null, to='BURL:topic://amq.topic/status?routingkey='status'&exclusive='false'&autodelete='true'', subject='null', replyTo='null', correlationId=W-tile5yGYYReUW3F9CQ--j9ahQ, contentType=application/octet-stream, contentEncoding=null, absoluteExpiryTime=null, creationTime=Tue Jun 25 16:14:46 UTC 2024, groupId='null', groupSequence=null, replyToGroupId='null'}, extraProperties = null], context: RoutingContextImpl(Address=null, routingType=null, PreviousAddress=null previousRoute:null, reusable=null, version=0)
    ..................................................
    , transaction: null

I have lunch incoming and then the 1pm meeting. Not to mention RHEL 8 migration. I'm going to put this on-hold and leave it assigned to myself, but may need someone else to resolve this.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T16:18:54Z


By the way, it appears to be stuck within the EvaluationMessengerTest:

> :wres-events:test > Executing test wres.events.EvaluationMessagerTest

Maybe there is an artemis reference there that I didn't account for. I'll have to check later.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T17:55:46Z


Here is an update on the unit tests:

ore.server.impl.QueueImpl -- Scanning for expires on 8e936481-0a6c-4a3a-b170-4fc0d2f6dab0 done
<=====--------> 43% EXECUTING [1h 51m 49s]
> :wres-events:test > 20 tests completed, 12 failed
> :wres-events:test > Executing test wres.events.publish.MessagePublisherTest

I canceled the tests.

Looking for feedback... 2.34 appears to work out of the box with a single change to the @wres-eventsbroker/Dockerfile@. Should I push that change or should I wait for a solution to the 2.35 issue documented here.?*

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T18:01:14Z


When I made the changes James proposed in #131786-3, I also made changes to the Artemis version noted in the @build.gradle@. For example:

        implementation('org.apache.activemq:artemis-server:2.35.0') {
</code>

So the @wres-eventsbroker@ uses 2.35. The @wres-events@ project then depends on that for JUnit testing:

        // To create an embedded broker for unit testing
        testImplementation project(':wres-eventsbroker')
</code>

It also includes usage of QPid which requires the older version of netty:

        // Because of CVE-2022-24823. TODO: remove when Qpid catches up
        implementation group: 'io.netty', name: 'netty-all', version: '4.1.109.Final'
</code>

So perhaps the 2.35 usage in @wres-eventsbroker@ is causing the same type of connection problem that I saw in staging when using 2.35 in the @Dockerfile@. Or perhaps there was a build conflict created when I modified @wres-eventsbroker@ without changing any other projects in @build.gradle@.

This is really outside my wheelhouse. Guess its good to branch out from time to time.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T18:15:11Z


It may be the same issue. Unit tests use an in-memory broker, which depends on the broker version in the @build.gradle@. The service depends on the version in the dockerfile because the broker is instantiated separately. The important thing with upgrading the @build.gradle@ to 2.35 is that you update all instances and also delete those pairs of netty overrides (one to remove an old version, one to add back a later version) in each context where artemis appears. Beyond that, you won't have done anything wrong and the problem is probably the same underlying problem, and seemingly not related to netty (as I said, it would've been a lucky guess).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T18:44:09Z


Got distracted by an email from Anna. She's looking for scatter plots of observed and predicted, or at least the data. I pointed her to the pairs for the data, but hope to find an existing ticket for the plots. Once I find that ticket (if it exists), I'll return to this and scan the @build.gradle@ carefully for any Artemis references I missed.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T19:26:24Z


"artemis" is only referred to in the @wres-eventsbroker@ @project@ in the @build.gradle@. The version 2.33 is referred to for the @server@ and the @protocol@; I updated both to 2.35. My modified @wres-eventsbroker@ @project@ looks like this:

project(':wres-eventsbroker') {
    dependencies {
        implementation 'org.slf4j:slf4j-api:2.0.13'

        implementation 'org.apache.commons:commons-lang3:3.14.0'

        // Jakarta Messaging API
        implementation group: 'jakarta.jms', name: 'jakarta.jms-api', version: '3.1.0'

        // Apache ActiveMQ Artemis server
        implementation('org.apache.activemq:artemis-server:2.35.0') {
            // TODO: remove when Artemis catches up

            // Because we use slf4j, not jcl:
            exclude group: 'commons-logging', module: 'commons-logging'

            // Because we use slf4j, not jboss-logmanager:
            exclude group: 'org.jboss.logmanager', module: 'jboss-logmanager'

            // Because of CVE-2022-24823.
            //exclude group: 'io.netty'

            // Because of CVE-2023-2976, CVE-2020-8908.
            exclude group: 'com.google.guava', module: 'guava'

            // Because of CVE-2024-29131, CVE-2024-29133.
            exclude group: 'org.apache.commons', module: 'commons-configuration2'

            // Because of CVE-2024-34447, CVE-2024-29857, CVE-2024-30171, CVE-2024-30172.
            exclude group: 'org.bouncycastle', module: 'bcprov-jdk18on'
        }

        // Include later dependency versions for the excluded dependencies above
        implementation 'com.google.guava:guava:33.2.0-jre'

        // Include later dependency version
        implementation group: 'org.apache.commons', name: 'commons-configuration2', version: '2.10.1'

        // Include later dependency version
        implementation group: 'org.bouncycastle', name: 'bcprov-jdk18on', version: '1.78.1'

        // AMQP 1.0 protocol support for the ActiveMQ Artemis server
        implementation('org.apache.activemq:artemis-amqp-protocol:2.35.0') {
            // Because we use slf4j, not jcl:
            exclude group: 'commons-logging', module: 'commons-logging'

            // Because we use slf4j, not jboss-logmanager:
            exclude group: 'org.jboss.logmanager', module: 'jboss-logmanager'

            // Because of CVE-2022-24823. TODO: remove when Artemis catches up
            //exclude group: 'io.netty'
        }

        // Because of CVE-2022-24823. TODO: remove when Artemis catches up
        //implementation group: 'io.netty', name: 'netty-all', version: '4.1.109.Final'

        // JCIP annotations
        compileOnly group: 'net.jcip', name: 'jcip-annotations', version: '1.0'

        runtimeOnly('ch.qos.logback:logback-classic:1.5.6') {
            // Not used at runtime, bloat
            exclude group: 'edu.washington.cs.types.checker', module: 'checker-framework'
        }

        // Because artemis uses apache commons logging, bridge to slf4j
        runtimeOnly 'org.slf4j:jcl-over-slf4j:2.1.0-alpha1'

        // Qpid AMQP 1.0 Jakarta Messaging API client to test interaction with the embedded broker
        testImplementation('org.apache.qpid:qpid-jms-client:2.5.0') {
            // Because of CVE-2022-24823. TODO: remove when Qpid catches up
            exclude group: 'io.netty'
        }

        // JUnit 5 API and runtime
        testImplementation 'org.junit.jupiter:junit-jupiter-api:5.10.2'
        testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.10.2'
    }

    // Facilitate unit tests with JUnit
    test {
        useJUnitPlatform()
    }

}
</code>

I made no other changes to the @build.gradle@. With that change, the unit tests for @wres-events@ runs for a very long time with failures reported.

Per James's comment, I think this is the manifestation of the same problem I see in staging using 2.35 in the @Dockerfile@. So I'm hopefully if we can fix the unit tests, we can fix the service.

I'm going to test moving to 2.34 to see if that allows for the unit tests to complete (it allows for staging to work).

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T19:29:32Z


Right. In this context, running for a very long time will mean running through the connection retries for each unit test, most likely. The failure of the core app after no word from a subscriber in PT5M is probably not in play for the unit tests, although I would need to check. Anyway, I predict the unit tests will pass with 2.34 and it will be the same issue, but a slightly different manifestation w/r to the timeouts.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T19:31:28Z


The 2.34 change would be to only two lines in the @build.gradle@. Since the @netty@ change James spotted was in 2.35, I need to continue to exclude @io.netty@ and forcibly set the version used to 4.1.109.

I reverted my changes for 2.35, made the changes for 2.34 (again, only two lines), and ran the build. The unit tests are on-going. If they succeed, then I plan to push my change to @build.gradle@ and @wres-eventsbroker/Dockerfile@ so that we can deploy what we have using @dockerize.sh@ without issue. However, I don't know how long 2.34 will be available given 2.35 is already out there.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T19:33:13Z


I am pretty sure you can leave as-is (overrides removed) because 2.34 uses netty 4.1.109.Final out of the box.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-06-25T19:33:55Z


Remember seeing that in the release notes for 2.34.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T19:34:34Z


James wrote:

I am pretty sure you can leave as-is (overrides removed) because 2.34 uses netty 4.1.109.Final out of the box.

Okay. I'll try that when the current set of tests complete.

I hope to get the 2.34 changes pushed today, but time is running out.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T19:44:23Z


First set of unit tests pass.

I'm now removing the netty lines and testing again,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2024-06-25T19:53:14Z


Tests passed. I've pushed the 2.34 changes in commit:1f09e46d8f73ce0bd5dc08c9197b8b2d93bebe3d. That should allow us to deploy the events broker until they decide to remove the 2.34 version.

Changes for 2.35 will have to come later, perhaps when we update dependencies for the next deployment.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-07-02T13:44:30Z


Not addressed in 6.25