apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.28k stars 3.59k forks source link

[improve][broker] Supplement schema ledger if schema ledger is lost #20414

Open Denovo1998 opened 1 year ago

Denovo1998 commented 1 year ago

Search before asking

Motivation

17221 describes an environment when multiple bookie copies are corrupted, or a Ledger has been deleted. The loss of schema ledger results in new producers and consumers not even being created and working properly.

According to the solution of PR #18010, enable autoSkipNonRecoverableData and skip has gotten lost schema can lead to the schema information is not complete. And in the existing code, schema corruption will delete the metadata. https://github.com/apache/pulsar/blob/a953027aad38c9f54e952133949280ec2f4c04e8/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/schema/SchemaRegistryServiceImpl.java#L564-L570 If an error is not recoverable will be deleted, but PR #18010 and #19882 has been maked NoSuchLedgerExistsOnMetadataServerException also as recoverable exception.

So we need a solution that does not just skip the schema with the missing ledger, but actually supplements the broken schema ledger.

Solution

A new method called tryCompleteTheLostSchemaLedger. When the schema ledger losted, if the new consumer subscription or a new producer created, when there is a "Failed to open gotten" such an error, call tryCompleteTheLostSchemaLedger method.

CompletableFuture<Long> tryCompleteTheLostSchemaLedger(String schemaId, SchemaVersion version, SchemaData schema);

This method attempts to create a new ledger save schemaData and then update the new ledger id to the metadata. Now, connected producers and consumers can work even if scheme ledger is deleted. To get the SchemaData, we need to store the SchemaData and SchemaVersion information in the topic(org.apache.pulsar.broker.service.AbstractTopic). When calling tryCompleteTheLostSchemaLedger incoming.

Alternatives

  1. In the broker, org.apache.pulsar.broker.service.Producer and org.apache.pulsar.broker.service.Consumer do not save SchemaData and SchemaVersion, and only call tryCompleteTheLostSchemaLedger through the admin api. Perhaps we should directly implement this function on the upload schema function(https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema), then we need to pass in an additional flag to identify whether to register or make up for the missing schema. Of course, for compatibility, the default behavior should be to register a new schema.
  2. The corresponding schema information is also saved on the Client side. Perhaps the broker side can initiate a request for each connecting consumer or producer to obtain schema information?(Now, connected producers and consumers can work even if scheme ledger is deleted.) In this way, schema information does not need to be cached on the broker side.
  3. Store the SchemaData and SchemaVersion information in the org.apache.pulsar.broker.service.Producer and org.apache.pulsar.broker.service.Consumer that are connected or subscribed to the topic on the broker side.(Not an overall alternative, only contains how to store SchemaData and SchemaVersion that have been lost)

Anything else?

Please pay attention to the alternatives and leave your ideas for discussion. I will modify the implementation in pr.

Are you willing to submit a PR?

Denovo1998 commented 1 year ago

@poorbarcode @codelipenghui @rdhabalia PTAL!

I came up with such a new idea to solve the problem of schema ledger loss.

Now I'm in org.apache.pulsar.broker.service.schema.SchemaServiceTest#testSchemaLedgerLost tested tryCompleteTheLostSchemaLedger() is no problem, new producers and consumers work.

But first we need to talk about how we get SchemaVersion and SchemaData in the SchemaRegistry. See if two of the Solution and Alternatives are feasible, or do you have any other good suggestions?

The code is in #20415 (some work is not done).

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

Denovo1998 commented 1 year ago

Waiting to discuss whether this plan is feasible. I will send an email to discuss it later.

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

Denovo1998 commented 10 months ago

In the alternative, the implementation is updated. Needs to be discussed.