Closed magnusbaeck closed 2 years ago
https://github.com/eiffel-community/eiffel-remrem-publish/pull/218 might solve this
218 might solve this
https://github.com/eiffel-community/eiffel-remrem-publish/pull/219 fixes the issue. Changes are in review.
Closing since this is ostensibly fixed in aforementioned PR.
Description
We're using REMReM Publish (built outselves from commit 6b9e474) to send ArtC events from our builds, and it seems it at least occasionally mixes up the inbound requests and either returns the wrong id to the client or sends a 207 response with multiple events even though only a single event was passed as input. It's possible that this only happens when REMReM perceives a problem with RabbitMQ.
Let's look at two builds that send events at the same time. The purl versions are based on the Jenkins job name and number so it's easy to correlate artifacts with the build in which they were created. Builds run independently and on different hosts. For confidentiality reasons I've replaced potentially sensitive strings.
Build 1
Job name and number: Teams/aaa/master_bundle/QA/Products/bbb_Build/1539
At 22:13:49 the REMReM Publish request fails with a 207 error and the following payload:
So it claims it attempted to send four events, but that the last one failed. Looking at the calling code I have a very hard time imagining that the calling code would've passed four events. Note that the "RabbitMQ is down" error message could be misleading. As seen in #172 and in the code, potentially useful error messages are replaced with the generic "RabbitMQ is down" message.
Build 2
Job name and number: Teams/aaa/bbb/job/FT-ccc/job/Minimal_ref/job/Products/job/ddd_Build/8036
At 22:13:49 the REMReM Publish request succeeds and it reports that the id of the sent event was bbdef2ea-7217-484c-a0b6-3988daab4084. Only the first returned event is checked so I can't rule out that multiple events where included in the successful response.
Analysis
Note that the id in the response for build 2, bbdef2ea-7217-484c-a0b6-3988daab4084, is the same as the first event in the build 1 response.
Looking into what actually was published on the bus by looking in the ER things get even weirder. Here are the three events that were reported as successful in the build 1 response:
meta.id
data.identity
So the id that was returned in build 2 is actually from a third build. Looking in the log of that build it also received bbdef2ea-7217-484c-a0b6-3988daab4084 as the event id of the event it posted (at the same time as the other builds, i.e. at 22:13:49), so at least that build got the correct result.
This suggests that there's corruption in some REMReM data structure that isn't thread safe.
Motivation
The evidence above suggests a bug that clearly should be addressed.
Exemplification
See above.
Benefits
Obvious.
Possible Drawbacks
None.