eiffel-community / eiffel-remrem-publish

eiffel-remrem-publish
https://eiffel-community.github.io/eiffel-remrem-publish
Apache License 2.0
8 stars 77 forks source link

REMReM mixes up requests when sending responses #217

Closed magnusbaeck closed 2 years ago

magnusbaeck commented 2 years ago

Description

We're using REMReM Publish (built outselves from commit 6b9e474) to send ArtC events from our builds, and it seems it at least occasionally mixes up the inbound requests and either returns the wrong id to the client or sends a 207 response with multiple events even though only a single event was passed as input. It's possible that this only happens when REMReM perceives a problem with RabbitMQ.

Let's look at two builds that send events at the same time. The purl versions are based on the Jenkins job name and number so it's easy to correlate artifacts with the build in which they were created. Builds run independently and on different hosts. For confidentiality reasons I've replaced potentially sensitive strings.

Build 1

Job name and number: Teams/aaa/master_bundle/QA/Products/bbb_Build/1539

At 22:13:49 the REMReM Publish request fails with a 207 error and the following payload:

{
    "events": [{
        "id": "bbdef2ea-7217-484c-a0b6-3988daab4084",
        "status_code": 200,
        "result": "SUCCESS",
        "message": "Event sent successfully"
    }, {
        "id": "78df719f-06a7-420f-9e25-5d1e11ce4e3d",
        "status_code": 200,
        "result": "SUCCESS",
        "message": "Event sent successfully"
    }, {
        "id": "258ff6dc-f6f1-4ef0-acbd-1ed0679d55ca",
        "status_code": 200,
        "result": "SUCCESS",
        "message": "Event sent successfully"
    }, {
        "id": "322b138c-fa9f-4c4b-93f3-3905ba2f73c6",
        "status_code": 500,
        "result": "Internal Server Error",
        "message": "RabbitMQ is down. Please try later"
    }]
}

So it claims it attempted to send four events, but that the last one failed. Looking at the calling code I have a very hard time imagining that the calling code would've passed four events. Note that the "RabbitMQ is down" error message could be misleading. As seen in #172 and in the code, potentially useful error messages are replaced with the generic "RabbitMQ is down" message.

Build 2

Job name and number: Teams/aaa/bbb/job/FT-ccc/job/Minimal_ref/job/Products/job/ddd_Build/8036

At 22:13:49 the REMReM Publish request succeeds and it reports that the id of the sent event was bbdef2ea-7217-484c-a0b6-3988daab4084. Only the first returned event is checked so I can't rule out that multiple events where included in the successful response.

Analysis

Note that the id in the response for build 2, bbdef2ea-7217-484c-a0b6-3988daab4084, is the same as the first event in the build 1 response.

Looking into what actually was published on the bus by looking in the ER things get even weirder. Here are the three events that were reported as successful in the build 1 response:

meta.id data.identity
258ff6dc-f6f1-4ef0-acbd-1ed0679d55ca pkg:aaa-fib-jenkins/bbb-4ch@Teams-ccc-ddd-FT-eee-Minimal_ref-Products-fff_Build-8036
78df719f-06a7-420f-9e25-5d1e11ce4e3d pkg:aaa-fib-jenkins/bbb@Teams-ccc-ddd-FT-eee-Minimal_ref-Products-fff_Build-1521
bbdef2ea-7217-484c-a0b6-3988daab4084 pkg:aaa-fib-jenkins/bbb@Teams-ccc-ddd-FT-eee-Minimal_ref-Products-fff_Build-8239

So the id that was returned in build 2 is actually from a third build. Looking in the log of that build it also received bbdef2ea-7217-484c-a0b6-3988daab4084 as the event id of the event it posted (at the same time as the other builds, i.e. at 22:13:49), so at least that build got the correct result.

This suggests that there's corruption in some REMReM data structure that isn't thread safe.

Motivation

The evidence above suggests a bug that clearly should be addressed.

Exemplification

See above.

Benefits

Obvious.

Possible Drawbacks

None.

m-linner-ericsson commented 2 years ago

https://github.com/eiffel-community/eiffel-remrem-publish/pull/218 might solve this

zburswa commented 2 years ago

218 might solve this

https://github.com/eiffel-community/eiffel-remrem-publish/pull/219 fixes the issue. Changes are in review.

magnusbaeck commented 2 years ago

Closing since this is ostensibly fixed in aforementioned PR.