databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.86k stars 750 forks source link

bug: Backend error log when multi-thread insert into table on GCS - Rate Limit Exceeded `last_snapshot_location_hint` #16187

Open rad-pat opened 3 months ago

rad-pat commented 3 months ago

Search before asking

Version

v1.2.597-nightly

What's Wrong?

The following error is logged in Databend Query pod when inserting into a table on multiple threads. Ultimately, it seems that all the data is inserted, but presumably the error should not be generated.

d76ce823-6643-41dc-a245-f57dbaba513f 2024-08-05T09:48:23.524152Z ERROR opendal::layers::logging: logging.rs:1155 service=gcs operation=Writer::close path=1/175312/last_snapshot_location_hint written=0B -> data close failed: Unexpected (persistent) at Writer::close => GcsErrorResponse { error: GcsError { code: 429, message: "The object rhynl-pctb-lakehouse-bugfixes2/1/175312/last_snapshot_location_hint exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", errors: [GcsErrorDetail { domain: "usageLimits", location: "", location_type: "", message: "The object rhynl-pctb-lakehouse-bugfixes2/1/175312/last_snapshot_location_hint exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", reason: "rateLimitExceeded" }] } }
Context:
   uri: https://storage.googleapis.com/upload/storage/v1/b/rhynl-pctb-lakehouse-bugfixes2/o?uploadType=media&name=1/175312/last_snapshot_location_hint
   response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json; charset=UTF-8", "date": "Mon, 05 Aug 2024 09:48:23 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "x-guploader-uploadid": "AHxI1nOZ2p1QQ22INPuW-7zI4yu6WX3wyE7vfDPlGrPf3p7_dgntzJOilCzv4o87vG-3MmaME64CnbkATg", "content-length": "681", "server": "UploadServer"} }
   service: gcs
   path: 1/175312/last_snapshot_location_hint
   written: 53
d76ce823-6643-41dc-a245-f57dbaba513f 2024-08-05T09:48:23.524203Z  WARN databend_common_storages_fuse::operations::commit: commit.rs:277 write last snapshot hint failure. Unexpected (persistent) at Writer::close, context: { uri: https://storage.googleapis.com/upload/storage/v1/b/rhynl-pctb-lakehouse-bugfixes2/o?uploadType=media&name=1/175312/last_snapshot_location_hint, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json; charset=UTF-8", "date": "Mon, 05 Aug 2024 09:48:23 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "x-guploader-uploadid": "AHxI1nOZ2p1QQ22INPuW-7zI4yu6WX3wyE7vfDPlGrPf3p7_dgntzJOilCzv4o87vG-3MmaME64CnbkATg", "content-length": "681", "server": "UploadServer"} }, service: gcs, path: 1/175312/last_snapshot_location_hint, written: 53 } => GcsErrorResponse { error: GcsError { code: 429, message: "The object rhynl-pctb-lakehouse-bugfixes2/1/175312/last_snapshot_location_hint exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", errors: [GcsErrorDetail { domain: "usageLimits", location: "", location_type: "", message: "The object rhynl-pctb-lakehouse-bugfixes2/1/175312/last_snapshot_location_hint exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", reason: "rateLimitExceeded" }] } }

How to Reproduce?

Script below is Python, but multi-thread insert into a table should be easily replicable otherwise:

import sqlalchemy as sa
from concurrent.futures import ThreadPoolExecutor, as_completed

def insert_into_table(con):
    with con.begin() as c:
        c.execute(sa.text('INSERT INTO t (a, b) VALUES (1, 2)'))

def main():
    con = sa.create_engine('databend://root:pwd@databend-query/default?sslmode=disable')
    with con.begin() as c:
        c.execute(sa.text('DROP TABLE IF EXISTS t'))
        c.execute(sa.text('CREATE TABLE t (a int not null, b int not null)'))

    with ThreadPoolExecutor(max_workers=256) as executor:
        futures = []
        for _ in range(256):
            futures.append(executor.submit(insert_into_table, con))

        for future in as_completed(futures):
            future.result()

if __name__ == '__main__':
    main()

Are you willing to submit PR?

inviscid commented 3 months ago

And just for GCS context since it may differ from S3 or Azure blob, an object can't be mutated more than 1 time per second. In this case it looks like the same file may be updated many times very quickly due to the load from multiple threads.

inviscid commented 3 months ago

@dantengsky please let us know if you need anything from us to help resolve this. This has unfortunately turned into a blocker for our go-live ramp.

dantengsky commented 3 months ago

@dantengsky please let us know if you need anything from us to help resolve this. This has unfortunately turned into a blocker for our go-live ramp.

Thank you for letting us know about this issue.

Ultimately, it seems that all the data is inserted, but presumably, the error should not be generated.

As you mentioned, the failure to write last_snapshot_location_hint does not prevent the transaction from being committed successfully.

The last_snapshot_location_hint is written on a best-effort basis. Currently, only the attach table functionality relies on this hint, which may read stale data (or fail to read) if the hint file has not been successfully written.

Normal table scans are not affected, as they do not depend on this hint file.

A new setting will be added to allow disabling the writing of the last_snapshot_location_hint if needed.