amqp / rhea

A reactive messaging library based on the AMQP protocol
Apache License 2.0
280 stars 79 forks source link

RangeError [ERR_OUT_OF_RANGE] after longer operation times #168

Closed sh-developers closed 5 years ago

sh-developers commented 5 years ago

Background

We implement request/response communication pattern via AMQP and use Qpid (C++ version) as the amqp broker. We ported our applications from amqp10 to rhea because when we moved from node 8 to node 10, we started getting buffer write errors (very similar to the one I'm about to show you).

We create our receiver queues via the same connection using the QMF2 framework (which is probably not relevant, because the error only occurs on sender connections, for which the queue is created by other services).

Description

The errors we have encountered with rhea only start showing up after the application has run for a while (in this case a day on a low traffic test server). These errors start with the number 256 and start counting up from there (281 in the example). We have only observed this error occurring when we are sending to an already existing queue.

Example

[RangeError [ERR_OUT_OF_RANGE]] "The value of "value" is out of range. It must be >= 0 and <= 255. Received 281" RangeError [ERR_OUT_OF_RANGE]: The value of "value" is out of range. It must be >= 0 and <= 255. Received 281
    at writeU_Int8 (internal/buffer.js:559:11)
    at Buffer.writeUInt8 (internal/buffer.js:569:10)
    at types.Writer.write_uint (/app/node_modules/rhea/lib/types.js:725:28)
    at types.Writer.write_variable_width (/app/node_modules/rhea/lib/types.js:751:10)
    at types.Writer.write_value (/app/node_modules/rhea/lib/types.js:789:14)
    at types.Writer.write (/app/node_modules/rhea/lib/types.js:779:14)
    at types.Writer.write_compound (/app/node_modules/rhea/lib/types.js:814:18)
    at types.Writer.write_value (/app/node_modules/rhea/lib/types.js:791:14)
    at types.Writer.write (/app/node_modules/rhea/lib/types.js:779:14)
    at types.Writer.write_compound (/app/node_modules/rhea/lib/types.js:814:18)

Further observations

Because the number always counts up, we think that the number being written might be a counter of some sorts? We use a correlationId consisting of a string (ULID-identifier), so it cant be that. All other information being non-counting, the only other number we send ourselves is the userId number, which is always the same on this environment, and is always observed to be in the tens not in the hunderds (and counting up).

If you need any further information, please tell us.

grs commented 5 years ago

It looks like it is a string or similar that is too large for the encoding used. The value complained about looks like it is the size, so e.g. a str8 or perhaps a vbin8 that is too large. However there are alternate encodings for string and binary types that I would expect to get used. Are you explicitly wrapping any values in your message with a particular type?

Are you able to create a simple reproducer? Or perhaps get some protocol traces from a run of the application to see if there is any trend in the data?

sh-developers commented 5 years ago

I noticed we just send the entire body as a simple Javascript object. As a fix I have tried to send the body as a JSON-encoded string to see if that fixes it.

grs commented 5 years ago

Any update on this?

sh-developers commented 5 years ago

Solution

Problems seem to be solved when serializing the body as JSON before placing a message on a queue. (and deserializing after)

Prevention

I would recommend adding some notice/warning/recommendation to serialize any body content before placing on the bus, or what the limitations of the AMQP serialization formats are. (Evidently we did not realize the limitations of the format when reading either the documentation of amqp10 or rhea.) For example that with larger messages, serialization of the body is better done with a dedicated message format like messagepack/JSON/...

If you would like, I could formulate something for the readme, but someone that actually implemented the encoding/formatting code would be better suited for this.

This ticket is solved for us now, thank you for your help and attention!

grs commented 5 years ago

Thanks for the update. Serialising javascript objects should not cause this error, so potentially there is still a bug. I can't see anything obvious in going through the code though.

MattOakley commented 5 years ago

We're getting this issue as well.

I think the number is the link_credit on the frame object called in session.js on the output method.

For e.g. we get "The value of "value" is out of range. It must be >= 0 and <= 255. Received -3" with a link_credit of -3 The frame:

delivery_count:951
drain:undefined
echo:undefined
handle:0
incoming_window:1866
link_credit:-3
next_incoming_id:951
next_outgoing_id:0
outgoing_window:4294967295
properties:undefined
grs commented 5 years ago

@MattOakley this is for a receiving link? (i.e. first thing created on the session was a receiver not a sender). It is certainly a bug for the link credit to go negative. Do you have a simple test or reproducer to isolate this? Is it possible you ever call receiver.add_credit()/flow() with a negative number?

grs commented 5 years ago

@MattOakley I pushed a check in the one place I see where the credit go in theory go negative (https://github.com/amqp/rhea/commit/230e71767e25991f02cb564bf656af1faa8a48d5); does that impact your use case?

MattOakley commented 5 years ago

That does impact our use case, I'll run a test with it tomorrow.

Although I can't see why that method would be hit when the credit is zero. Wouldn't this mess up the amount of inflight messages (or worse)?

In answer to your earlier question:

It is a receiving link and I don't have a simple script to reproduce it to hand, I could hopefully make one on monday if it helps.

We have a relatively long running on message handler (approx 1min) and the issue happens after a minute or so of consistent message flow. The credit window is 0 and when the message hander is finished we call accept() and add_credit(1), This is the only time we call this.

grs commented 5 years ago

I agree, I don't see why the credit is zero when a transfer is received unless the credit is reduced by the application (or there is some other aspect I have not yet understood).

If you are able to create a reproducer of some kind that would really help tracking it down.

MattOakley commented 5 years ago

We are sending to queues on sender links as well, not sure if that could be important?

MattOakley commented 5 years ago

Below is a reproducer script:

The change (https://github.com/amqp/rhea/commit/230e71767e25991f02cb564bf656af1faa8a48d5) fixes the connection error being thrown but, now writes a lot of Received transfer when credit was 0 messages. Also there's a typo which makes the logs print out NaN.

import * as container from "rhea";

const queueName = "rheaTest";
let senderConnection = container.connect({ port: 8091, host: "localhost" });
let receiverConnection = container.connect({ port: 8091, host: "localhost" });

senderConnection.on("error", async e => {
    console.log("Sender Error: " + e);
    debugger;
});

receiverConnection.on("error", async e => {
    console.log("Receiver Error: " + e);
    debugger;
});
receiverConnection.on("receiver_open", (context) => {
    context.receiver.flow(100);
});

let sender = senderConnection.open_sender({
    name: queueName,
    snd_settle_mode: 0,
    rcv_settle_mode: 0,
    target: {
        durable: 2,
        address: queueName
    }
});

container["once"]("sendable", (context: container.EventContext) => {
    let message = Array.from({ length: 250 }, (x, i) => i).join("");;
    for (let i = 0; i < 1000; i++) {
        let x = sender.send({ body: message, application_properties: [], durable: true, correlation_id: i });
    }

    let receiver = receiverConnection.open_receiver({
        name: queueName,
        autoaccept: false,
        rcv_settle_mode: 0,
        snd_settle_mode: 0,

        // Issue flow manually for async message handler.
        credit_window: 0,
        source: {
            durable: 2,
            address: queueName
        }
    });

    receiver.on("message", async (context: container.EventContext) => {
        console.log("received: " + context.message.correlation_id);

        let seconds = Math.floor(Math.random() * 6 + 25);
        await waitSeconds(seconds);

        console.log("resolved: " + context.message.correlation_id);
        context.delivery.accept();
        receiver.add_credit(1);
    });

    async function waitSeconds(seconds: number): Promise<{}> {
        return new Promise(resolve => {
            setTimeout(() => {
                resolve();
            }, seconds * 1000);
        });
    };
});

Running node 10 on windows Using RabbitMQ with amqp 1.0

The message Received transfer when credit was 0 was printed out 49 times for 1000 messages

grs commented 5 years ago

Thanks! I can reproduce with that and it looks like a bug in the rabbitmq 1.0 support (I'm using v3.6.16 of the 1.0 plugin in my test). It is not honouring credit at all. If a receiver link is attached and no credit is issued, the broker still sends messages. I've verified that with an entirely different client.

grs commented 5 years ago

Fyi I've raised the following issue: https://github.com/rabbitmq/rabbitmq-amqp1.0/issues/67

grs commented 5 years ago

Above issue is fixed from 3.7 apparently; I upgraded and indeed the simple issue is gone. However I can still reproduce the problem and it does still look to me like an error from the server where it sends transfers for which credit has not been issued, just more subtle.

grs commented 5 years ago

I've confirmed with wireshark trace that rabbitmq server will occasionally send more transfers than it is entitled from granted credit. Issue is already reported https://github.com/rabbitmq/rabbitmq-amqp1.0/issues/50

grs commented 5 years ago

@MattOakley I have fixed the typo in the error message. The current code ensures that the credit counter doesn't go negative when the server violates the protocol by sending more transfers than credits granted.

We could close the receiver with an error since the server is violating the protocol. If that would be preferred, let's raise a separate issue for that.

MattOakley commented 5 years ago

Ok, thanks for your help