Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.07k stars 1.19k forks source link

[Service Bus] TypeError: Cannot read property '_process' of undefined #13500

Closed ramya-rao-a closed 3 years ago

ramya-rao-a commented 3 years ago

Following up for @azure/service-bus v7 package issues from https://github.com/Azure/azure-sdk-for-js/issues/9792#issuecomment-746480634 by @otociulis and @NoelAbrahams Quoting @otociulis:

we're using "@azure/service-bus": "^7.0.0", (latest at the moment of writing) and we are running into same issue when running our microservice utilizing service bus in the Azure Cloud - usually this happens after longer periods of inactivity:

 /home/azure-user/render-service/node_modules/rhea/lib/connection.js:731
             this.local_channel_map[k]._process();
                                       ^
 TypeError: Cannot read property '_process' of undefined
     at Connection._process (/home/azure-user/render-service/node_modules/rhea/lib/connection.js:731:39)
     at processTicksAndRejections (internal/process/task_queues.js:79:11)

Typescript version: 4.0.5 Node.js version: 12.18.4 Rhea version (from package-lock.json): 1.0.24

ramya-rao-a commented 3 years ago

@otociulis,

Can you share some details on how your microservice is utilizing Service Bus? Which methods of the ServiceBusClient are you using and how are you re-using the client across operations? You mentioned inactivity, was this for sending or receiving messages? Do you have any logs that can show what was happening with the client just before this error showed up?

@NoelAbrahams, same questions for you.

@chradek will pick this issue up on Monday (Feb 1) and may have more follow up questions.

NoelAbrahams commented 3 years ago

@ramya-rao-a

This is all the stack we have at the moment.

Cannot read property '_process' of undefined
at Connection._process (E:\sitesroot\foo-bar\node_modules\rhea\lib\connection.js:731:39)

We use the service bus to implement push notifications. Basically, a new message is pushed onto a service bus topic and we have subscribers listening to the topic. When they receive a message, they dispatch it to the user's device. We use the recieveMode of 'receiveAndDelete'.

That's about it. But this error keeps getting logged periodically.

chradek commented 3 years ago

@NoelAbrahams Can you share your package-lock.json or package.json? Can you tell if there are multiple versions of rhea installed in your project?

When you say this periodically occurs, is it after some sort of event like a temporary network issue, or is it every ~N minutes?

If possible, can you also enable logging with the following end variable setting and share the results?: export DEBUG=azure*,rhea*,-rhea:raw,-rhea:message,-azure:core-amqp:datatransformer

otociulis commented 3 years ago

Hi @ramya-rao-a ,

thanks for picking this up. We are using SB as a communication channel to push updates between Nest.js microservices - one of the microservice is running as AppService on Azure and other one is a separate Azure VM running on Ubuntu Linux in Standard NC6_Promo (6 vcpus, 56 GiB memory) instance. The VM is long-running (weeks of operation) and it consumes and replies back on SB for messages sent from AppService(s) which can come in days apart.

I'm using ServiceBusClient with DefaultAzureCredentials identity like this:

const identity = new DefaultAzureCredential();
this.serviceBusClient = new ServiceBusClient(namespace, identity);

this.receiver = this.serviceBusClient.createReceiver(
    source.topic,
    source.subscription,
);

if (destination) {
    this.sender = this.serviceBusClient.createSender(destination.topic);
}

this.receiver.subscribe(this, {
    autoCompleteMessages: false,
});

inside a class that implements MessageHandlers interface; I implemented processMessage that parses incoming message and either completes it or abandons:

try {
    // Do processing stuff and generate replies

    await Promise.all(
        outgoingMessages.map((m) => this.sender.sendMessages(m)),
    );

    await this.receiver.completeMessage(message);
} catch (ex) {
    await this.receiver.abandonMessage(message);
    throw ex;
}

I also have processError implemented that just writes the error message into logging system.

From what I see on my end is that everything works fine until it doesn't because of some kind of internal timeout - once I restart the service everything starts working again.

I have only this rhea version in package lock:

"rhea": {
  "version": "1.0.24",
  "resolved": "https://registry.npmjs.org/rhea/-/rhea-1.0.24.tgz",
  "integrity": "sha512-PEl62U2EhxCO5wMUZ2/bCBcXAVKN9AdMSNQOrp3+R5b77TEaOSiy16MQ0sIOmzj/iqsgIAgPs1mt3FYfu1vIXA==",
  "requires": {
    "debug": "0.8.0 - 3.5.0"
  }
},
chradek commented 3 years ago

@otociulis Is it the VM service that runs for weeks that's seeing this issue, but not the AppService-based microservice? Do you know about how long it takes for you to see this error after you start your service? How long after the last message you received?

We've tried reproducing this issue in the past with no luck, but I'm wondering if a long period of inactivity is a key part of this.

Sahana-Ramdurg commented 3 years ago

We have upgraded to azure service bus sdk version to v7.0.3 two weeks back. We also have messageHandler and errorHandler in place where it logs error messages in case of any issues. Very often we see that the subscriber stops receiving messages and there are no error logs at the subscriber end. After restart, subscriber starts processing the messages. Its very sporadic and difficult to reproduce and debug as well.

We had faced this issue in previous versions and had raised an issue with Microsoft earlier. Please refer to GIT issue - https://github.com/Azure/azure-sdk-for-js/issues/9926#issue-652268637). Upon Microsoft suggestion we upgraded to v7.0.3.

After upgrade also we are still facing the same issue which has high impact on our end consumers.

@ramya-rao-a Can you please help us with the above issue.

ramya-rao-a commented 3 years ago

@Sahana-Ramdurg This issue is specific to the TypeError: Cannot read property '_process' of undefined scenario which is not your case. Please log a support ticket in the Azure portal with information around the timestamp for when you see the subscriber stop receiving messages, so that they can also check the server side logs to see if we can learn something from there. Please ask for the Azure SDK team to be looped in at the earliest as per this comment. Also share a code snippet of how you are using the ServiceBusClient and whether you are using sessions or not

Sahana-Ramdurg commented 3 years ago

@ramya-rao-a Thanks a lot for your reply. We have created support ticket - 2103120060003217

chradek commented 3 years ago

Unfortunately I haven't been able to reproduce this issue. We're going to close this issue due to inactivity, but if someone encounters it again please open a new issue and if possible provide logs by setting the DEBUG environment variable to DEBUG=azure*,rhea*,-rhea:raw,-rhea:message,-azure:core-amqp:datatransformer

chradek commented 3 years ago

Reopening as we've finally been able to reproduce this issue in one of our testing environments and I was able to track down what is happening.

So, the error that occurs is TypeError: Cannot read property 'process' of undefined. This is being thrown from rhea's Connection.prototype._process method when iterating over the local session channels:

for (var k in this.local_channel_map) {
  this.local_channel_map[k]._process(); // <-- This is where the error is thrown
}

This looks like it should be completely safe at first glance. If we have 3 sessions on the connection, we'll iterate over them from the local_channel_map and call _process() on each one. If for some reason a session is deleted from this.local_channel_map, it isn't a big deal because the for..in is smart enough to skip deleted members.

However, there is also Connection.prototype.remove_all_sessions which we end up calling in our disconnected logic. It completely replaces this.local_channel_map:

Connection.prototype.remove_all_sessions = function() {
  this.remote_channel_map = {};
  this.local_channel_map = {}; // <-- Goodbye old object!
};

I did notice this when I investigated this error in the past and suspected the issue had to do with remove_all_sessions somehow, but couldn't figure out a code path where it actually would. This method would need to be invoked somehow while iterating over the local_channel_map inside the Connection.prototype._process method. This would lead to the error being thrown because the this.local_channel_map in the for statement would be a different object than this.local_channel_map in the for body. That didn't seem possible because rhea doesn't ever call remove_all_sessions, only we do in our SDKs in response to the disconnected event. I figured surely the disconnected event would need to wait for the for..in loop to complete before it could be emitted since that loop is synchronous and thus blocking.

Oh, but there's always a twist.

Just a quick reminder, this._local_channel_map contains Session objects. When _process is called on a session, that in turn calls _process on each of the session's links. If a link need to write something, such as a flow frame, rhea ends up calling transport.write on the connection, which takes any pending frames that need to be written and sends them over the socket.

This is where things get interesting. If the socket was already closed for some reason, we get a writeAfterEnd error, which the rhea connection immediately handles and emits the disconnected event. Note that this event is emited synchronously, so the call stack still shows this is all triggered from that session._process that was invoked. Finally, our SDK's disconnected event handler removes all the sessions not by deleting them individually (which would be safe), but by replacing the local_channel_map object via remove_all_sessions.

In our test that is failing, we're calling connection.idle, which will cause the socket to end. I suspect that the users who see this would also have encountered the idle timeout at just the right moment for this sequence of events to occur, but really anything that causes the socket to be put in a closed state at the right time could lead to this.

So, how do we fix it?

The most straightforward way would be by updating rhea. There are 2 options:

  1. Update remove_all_sessions so that it deletes all the keys in the channel maps explicitly rather than overwriting the maps with new empty objects.
  2. Update _process to hold a reference to this._local_channel_map and use the same reference in both the for..in statement and the body of the for loop.

My vote would actually be to do both for total safety. The 1 issue we're going to run into is rhea has had a major version bump, so we'll either need to hotfix rhea 1.x, or absorb the breaking change present in rhea 2.x.