Closed rad-pat closed 1 month ago
I found some relevant information that might help address the timeouts you're experiencing with COPY INTO
statements in Databend v1.2.629-nightly.
There are known issues with concurrent execution of COPY INTO
statements in the version v1.2.629-nightly of Databend. For example, multiple COPY INTO
operations from a stage to a table result in errors, and the target table remains empty [1][2][3].
max_files
Setting:
The timeouts could be due to the max_files
setting in the COPY INTO
statement. If the max_files
value is set too high, it can exceed the COPY_MAX_FILES_PER_COMMIT
limit, causing performance issues and potential timeouts. Ensure that the max_files
parameter is set to a value less than COPY_MAX_FILES_PER_COMMIT
(15,000 files) to avoid these timeouts [4][5].
Resource Management:
Adjusting parameters like max_threads
and max_memory_usage
can help manage resource usage effectively. For example:
echo "set max_threads=4; set max_memory_usage=256000000; copy /*+ set_var(max_threads=4) set_var(max_memory_usage=256000000) */ into @s1/ from (select * from numbers(60000000)) max_file_size=64000000 detailed_output=true;" | $BENDSQL_CLIENT_CONNECT | wc -l | sed 's/ //g'
This example limits the number of threads and memory usage to optimize performance [3].
Distributed Copy:
Enabling distributed copy operations can help distribute the load and potentially reduce timeouts. You can set the enable_distributed_copy_into
parameter:
set enable_distributed_copy_into = 1;
This setting can help manage distributed copy operations effectively [1].
The specific error messages or codes related to timeouts in the Databend query pod logs are:
Timeout(1122)
TransactionTimeout(4003)
These error codes indicate timeout-related issues in the system [6].
By fine-tuning these parameters and settings, you can reduce the likelihood of timeouts and improve the efficiency of your COPY INTO
operations. If you need further assistance, providing the data and COPY INTO
statements to reproduce the issue could be helpful.
To continue talking to Dosu, mention @dosu.
cc @Xuanwo PTAL In the error file have many io timeout:
b0389480-6f16-4d73-8b99-c434aa573c3e 2024-09-05T05:51:11.262697Z ERROR opendal::layers::logging: logging.rs:190 service=gcs name=jglll-pctb-temp-asp-solution path=data_0.csv: read failed Unexpected (persistent) at read => io timeout reached
b0389480-6f16-4d73-8b99-c434aa573c3e 2024-09-05T05:51:11.263285Z ERROR databend_query::pipelines::builders::builder_on_finished: builder_on_finished.rs:78 copy failed, reason: StorageOther. Code: 4000, Text = operation interrupted (Unexpected (persistent) at read, context: { timeout: 60 } => io timeout reached).
149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061686Z WARN opendal::layers::retry: retry.rs:260 will retry after 1.513893843s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 97946-98105 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)
149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061771Z WARN opendal::layers::retry: retry.rs:260 will retry after 1.307216406s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 98369-105868 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)
149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061828Z WARN opendal::layers::retry: retry.rs:260 will retry after 1.608490825s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 4-97848 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)
Hi @rad-pat, it seems the server connection is very slow, causing it to fail to connect and perform DNS resolution.
1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:46.457720Z WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"1f32988b-3a43-4c33-82d9-7175f014fb5d", processor id: NodeIndex(0), name: "BytesReader", elapsed: 155.043454247s, active sync workers: 0
1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:49.313028Z ERROR opendal::layers::logging: logging.rs:190 service=gcs name=jglll-pctb-temp-asp-solution path=data_0.csv: read failed Unexpected (persistent) at read => send http request
Context:
url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media
called: http_util::Client::send
service: gcs
path: data_0.csv
range: 0-8388607
Source:
error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media): client error (Connect): operation timed out
1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:49.313657Z ERROR databend_query::pipelines::builders::builder_on_finished: builder_on_finished.rs:78 copy failed, reason: StorageOther. Code: 4000, Text = operation interrupted (Unexpected (persistent) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media, called: http_util::Client::send, service: gcs, path: data_0.csv, range: 0-8388607 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media)).
2024-09-05T05:53:49.316210Z WARN hickory_proto::xfer::dns_exchange: dns_exchange.rs:215 failed to associate send_message response to the sender
Please check if the server's network capacity has been reached. Databend is extremely fast and can use up to 100Gbps when reading large files concurrently.
In the error file have many io timeout:
Hi @wubx, thanks for helping with the debugging. Please note that an I/O timeout is a result, not a cause. When Databend raises a timeout, it simply indicates that something is slow; it doesn't mean Databend did anything wrong or unexpected.
We were using 8 vCPU machines and bumped them up to 48 vCPU machines. Google allocates about 2 Gbps per vCPU so this should get us to the node max. After increasing the node size, the imports seem to be working properly now.
It is interesting that the nodes don't seem to have any type of network prioritization for critical items like DNS. Not a Databend problem, just seems like Google would have set some sane priority defaults for critical network functions.
@Xuanwo , @wubx Thanks for helping us debug this issue and get it resolved. I'll close this now as it is not a Databend issue.
Search before asking
Version
v1.2.629-nightly
What's Wrong?
We have a a process to load multiple CSV concurrently. This currently executes a separate COPY INTO statement for each file once pre-processing of data has been done application-side. This has been working fine, but we have noticed that maybe with some larger files, or latest version of Databend that we now get timeouts reported on the Databend query pod and the imports fail. There are a couple of different timeouts that seem to happen in different locations. Please see logs attached copy_into_error_logs.txt
How to Reproduce?
I can probably supply the data and COPY INTO statements to reproduce this if required, but I was hoping that the error log might provide some insight. Ultimately there would be potentially 5 COPY INTO statements running, perhaps not initially concurrently as smaller files would finish pre-processing first. I can turn on more logging if required and re-run as required, please advise.
Are you willing to submit PR?