Open zbentley opened 2 years ago
Broker logs during the unload
of a wedged topic attached here.
Logs.txt
It seems that one of the consumers is not ack the msg. Could you share the partitioned-internal-stats of the topic?
Here are the internal stats:
{
"metadata" : {
"partitions" : 4
},
"partitions" : {
"persistent://sre1/chariot_namespace_perform_badging/chariot_topic_perform_badging-partition-3" : {
"entriesAddedCounter" : 1,
"numberOfEntries" : 1,
"totalSize" : 665,
"currentLedgerEntries" : 1,
"currentLedgerSize" : 674,
"lastLedgerCreatedTimestamp" : "2022-05-16T01:56:00.543Z",
"waitingCursorsCount" : 1,
"pendingAddEntriesCount" : 0,
"lastConfirmedEntry" : "45597:0",
"state" : "ClosedLedger",
"ledgers" : [ ],
"cursors" : {
"chariot_subscription-perform_badging-perform_badging_1" : {
"markDeletePosition" : "45597:0",
"readPosition" : "45597:1",
"waitingReadOp" : true,
"pendingReadOps" : 0,
"messagesConsumedCounter" : 1,
"cursorLedger" : 45917,
"cursorLedgerLastEntry" : 1,
"individuallyDeletedMessages" : "[]",
"lastLedgerSwitchTimestamp" : "2022-05-16T18:55:05.397Z",
"state" : "Open",
"numberOfEntriesSinceFirstNotAckedMessage" : 1,
"totalNonContiguousDeletedMessagesRange" : 0,
"subscriptionHavePendingRead" : true,
"subscriptionHavePendingReplayRead" : false,
"properties" : { }
}
},
"schemaLedgers" : [ ],
"compactedLedger" : {
"ledgerId" : -1,
"entries" : -1,
"size" : -1,
"offloaded" : false,
"underReplicated" : false
}
},
"persistent://sre1/chariot_namespace_perform_badging/chariot_topic_perform_badging-partition-2" : {
"entriesAddedCounter" : 0,
"numberOfEntries" : 0,
"totalSize" : 0,
"currentLedgerEntries" : 0,
"currentLedgerSize" : 0,
"lastLedgerCreatedTimestamp" : "2022-05-15T21:21:00.54Z",
"waitingCursorsCount" : 1,
"pendingAddEntriesCount" : 0,
"lastConfirmedEntry" : "45499:-1",
"state" : "LedgerOpened",
"ledgers" : [ ],
"cursors" : {
"chariot_subscription-perform_badging-perform_badging_1" : {
"markDeletePosition" : "45499:-1",
"readPosition" : "45499:0",
"waitingReadOp" : true,
"pendingReadOps" : 0,
"messagesConsumedCounter" : 0,
"cursorLedger" : -1,
"cursorLedgerLastEntry" : -1,
"individuallyDeletedMessages" : "[]",
"lastLedgerSwitchTimestamp" : "2022-05-15T21:21:00.543Z",
"state" : "NoLedger",
"numberOfEntriesSinceFirstNotAckedMessage" : 1,
"totalNonContiguousDeletedMessagesRange" : 0,
"subscriptionHavePendingRead" : true,
"subscriptionHavePendingReplayRead" : false,
"properties" : { }
}
},
"schemaLedgers" : [ ],
"compactedLedger" : {
"ledgerId" : -1,
"entries" : -1,
"size" : -1,
"offloaded" : false,
"underReplicated" : false
}
},
"persistent://sre1/chariot_namespace_perform_badging/chariot_topic_perform_badging-partition-1" : {
"entriesAddedCounter" : 0,
"numberOfEntries" : 1,
"totalSize" : 677,
"currentLedgerEntries" : 0,
"currentLedgerSize" : 0,
"lastLedgerCreatedTimestamp" : "2022-05-17T11:57:00.554Z",
"waitingCursorsCount" : 1,
"pendingAddEntriesCount" : 0,
"lastConfirmedEntry" : "45742:0",
"state" : "LedgerOpened",
"ledgers" : [ ],
"cursors" : {
"chariot_subscription-perform_badging-perform_badging_1" : {
"markDeletePosition" : "45742:0",
"readPosition" : "45742:1",
"waitingReadOp" : true,
"pendingReadOps" : 0,
"messagesConsumedCounter" : 0,
"cursorLedger" : -1,
"cursorLedgerLastEntry" : -1,
"individuallyDeletedMessages" : "[]",
"lastLedgerSwitchTimestamp" : "2022-05-17T11:57:00.556Z",
"state" : "NoLedger",
"numberOfEntriesSinceFirstNotAckedMessage" : 1,
"totalNonContiguousDeletedMessagesRange" : 0,
"subscriptionHavePendingRead" : true,
"subscriptionHavePendingReplayRead" : false,
"properties" : { }
}
},
"schemaLedgers" : [ ],
"compactedLedger" : {
"ledgerId" : -1,
"entries" : -1,
"size" : -1,
"offloaded" : false,
"underReplicated" : false
}
},
"persistent://sre1/chariot_namespace_perform_badging/chariot_topic_perform_badging-partition-0" : {
"entriesAddedCounter" : 0,
"numberOfEntries" : 0,
"totalSize" : 0,
"currentLedgerEntries" : 0,
"currentLedgerSize" : 0,
"lastLedgerCreatedTimestamp" : "2022-05-17T11:57:00.566Z",
"waitingCursorsCount" : 1,
"pendingAddEntriesCount" : 0,
"lastConfirmedEntry" : "46306:-1",
"state" : "LedgerOpened",
"ledgers" : [ ],
"cursors" : {
"chariot_subscription-perform_badging-perform_badging_1" : {
"markDeletePosition" : "46306:-1",
"readPosition" : "46306:0",
"waitingReadOp" : true,
"pendingReadOps" : 0,
"messagesConsumedCounter" : 0,
"cursorLedger" : -1,
"cursorLedgerLastEntry" : -1,
"individuallyDeletedMessages" : "[]",
"lastLedgerSwitchTimestamp" : "2022-05-17T11:57:00.568Z",
"state" : "NoLedger",
"numberOfEntriesSinceFirstNotAckedMessage" : 1,
"totalNonContiguousDeletedMessagesRange" : 0,
"subscriptionHavePendingRead" : true,
"subscriptionHavePendingReplayRead" : false,
"properties" : { }
}
},
"schemaLedgers" : [ ],
"compactedLedger" : {
"ledgerId" : -1,
"entries" : -1,
"size" : -1,
"offloaded" : false,
"underReplicated" : false
}
}
}
}
@Technoboy if a consumer is not acking a message, how come Prometheus reports no messages in backlog, and a zero or negative backlog size? Similarly, it reports no subscription backlogs. That also wouldn't explain why unloading the topic solves the problem.
Grafana screenshots below:
The issue had no activity for 30 days, mark with Stale label.
Describe the bug
Using partitioned topics and KeyShared subscriptions, when a time-based quota is exceeded, the quota is not properly "cleared", so producer creation and publication still gets ProducerBlockedQuotaExceeded even when there is no backlog on the topic.
Unloading the topic temporarily resolves the issue, but it reoccurs.
Behavior
Earlier today, we had a production user with a topic that got backlogged due to consumer shutdown, and their producers all got ProducerBlockedQuotaExceededExceptions (well, actually they got UnknownErrors because of https://github.com/apache/pulsar/issues/15078, but the logger showed the ProducerBlockedQuotaExceededException).
However, once consumers started and drained the backlog (pulsar_subscription_back_log reported 0 in prometheus for the only subscription on the topic), producers kept hitting the ProducerBlockedQuotaExceededException. New producers/new processes had the issue as well.
Unloading the topic temporarily resolved the issue, but it reoccurred repeatedly. Deleting/re-creating the topic also resolved the issue, but it also reoccurred.
This issue DOES reoccur even if consumers are present on the topic. There appears to be a risk of it occurring every time the topic's backlog drops to 0.
To reproduce
Broker heap dump
Available on request; it's too big for a GH attachment.
Context
Linux, Client 2.8.1, broker 2.8.1, deployed either standalone or in StreamNative Platform
Topics have 4 partitions
All producers use key-based batching, all consumers use KeyShared subscription mode.
Topic has a single KeyShared subscription.
Policies on the namespace (no topic-level policies in use):
Output of
partitioned-stats
for the topic: