bug: Timeouts when processing COPY INTO statements

rad-pat commented 1 month ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

v1.2.629-nightly

What's Wrong?

We have a a process to load multiple CSV concurrently. This currently executes a separate COPY INTO statement for each file once pre-processing of data has been done application-side. This has been working fine, but we have noticed that maybe with some larger files, or latest version of Databend that we now get timeouts reported on the Databend query pod and the imports fail. There are a couple of different timeouts that seem to happen in different locations. Please see logs attached copy_into_error_logs.txt

How to Reproduce?

I can probably supply the data and COPY INTO statements to reproduce this if required, but I was hoping that the error log might provide some insight. Ultimately there would be potentially 5 COPY INTO statements running, perhaps not initially concurrently as smaller files would finish pre-processing first. I can turn on more logging if required and re-run as required, please advise.

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

dosubot[bot] commented 1 month ago

I found some relevant information that might help address the timeouts you're experiencing with COPY INTO statements in Databend v1.2.629-nightly.

Similar Issues

There are known issues with concurrent execution of COPY INTO statements in the version v1.2.629-nightly of Databend. For example, multiple COPY INTO operations from a stage to a table result in errors, and the target table remains empty ^[1]^[2]^[3].

Potential Causes and Solutions

max_files Setting: The timeouts could be due to the max_files setting in the COPY INTO statement. If the max_files value is set too high, it can exceed the COPY_MAX_FILES_PER_COMMIT limit, causing performance issues and potential timeouts. Ensure that the max_files parameter is set to a value less than COPY_MAX_FILES_PER_COMMIT (15,000 files) to avoid these timeouts ^[4]^[5].

Resource Management: Adjusting parameters like max_threads and max_memory_usage can help manage resource usage effectively. For example:

echo "set max_threads=4; set max_memory_usage=256000000; copy /*+ set_var(max_threads=4) set_var(max_memory_usage=256000000) */ into @s1/ from (select * from numbers(60000000)) max_file_size=64000000 detailed_output=true;"  | $BENDSQL_CLIENT_CONNECT | wc -l | sed 's/ //g'

This example limits the number of threads and memory usage to optimize performance ^[3].

Distributed Copy: Enabling distributed copy operations can help distribute the load and potentially reduce timeouts. You can set the enable_distributed_copy_into parameter:
```
set enable_distributed_copy_into = 1;
```
This setting can help manage distributed copy operations effectively ^[1].

Error Codes

The specific error messages or codes related to timeouts in the Databend query pod logs are:

Timeout(1122)
TransactionTimeout(4003)

These error codes indicate timeout-related issues in the system ^[6].

By fine-tuning these parameters and settings, you can reduce the likelihood of timeouts and improve the efficiency of your COPY INTO operations. If you need further assistance, providing the data and COPY INTO statements to reproduce the issue could be helpful.

To continue talking to Dosu, mention @dosu.

wubx commented 1 month ago

cc @Xuanwo PTAL In the error file have many io timeout:

b0389480-6f16-4d73-8b99-c434aa573c3e 2024-09-05T05:51:11.262697Z ERROR opendal::layers::logging: logging.rs:190 service=gcs name=jglll-pctb-temp-asp-solution path=data_0.csv: read failed Unexpected (persistent) at read => io timeout reached

b0389480-6f16-4d73-8b99-c434aa573c3e 2024-09-05T05:51:11.263285Z ERROR databend_query::pipelines::builders::builder_on_finished: builder_on_finished.rs:78 copy failed, reason: StorageOther. Code: 4000, Text = operation interrupted (Unexpected (persistent) at read, context: { timeout: 60 } => io timeout reached).

149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061686Z  WARN opendal::layers::retry: retry.rs:260 will retry after 1.513893843s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 97946-98105 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)
149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061771Z  WARN opendal::layers::retry: retry.rs:260 will retry after 1.307216406s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 98369-105868 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)
149e1116-6ed3-47e6-8ce4-db8314268903 2024-09-05T05:51:42.061828Z  WARN opendal::layers::retry: retry.rs:260 will retry after 1.608490825s because: Unexpected (temporary) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media, called: http_util::Client::send, service: gcs, path: 113320/115525/_b/2928b3c1075d4da99bfff5312f97fa76_v2.parquet, range: 4-97848 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-starrocks-asp-solution/o/113320%2F115525%2F_b%2F2928b3c1075d4da99bfff5312f97fa76_v2.parquet?alt=media)

Xuanwo commented 1 month ago

Hi @rad-pat, it seems the server connection is very slow, causing it to fail to connect and perform DNS resolution.

1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:46.457720Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"1f32988b-3a43-4c33-82d9-7175f014fb5d", processor id: NodeIndex(0), name: "BytesReader", elapsed: 155.043454247s, active sync workers: 0
1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:49.313028Z ERROR opendal::layers::logging: logging.rs:190 service=gcs name=jglll-pctb-temp-asp-solution path=data_0.csv: read failed Unexpected (persistent) at read => send http request

Context:
   url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media
   called: http_util::Client::send
   service: gcs
   path: data_0.csv
   range: 0-8388607

Source:
   error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media): client error (Connect): operation timed out

1f32988b-3a43-4c33-82d9-7175f014fb5d 2024-09-05T05:53:49.313657Z ERROR databend_query::pipelines::builders::builder_on_finished: builder_on_finished.rs:78 copy failed, reason: StorageOther. Code: 4000, Text = operation interrupted (Unexpected (persistent) at read, context: { url: https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media, called: http_util::Client::send, service: gcs, path: data_0.csv, range: 0-8388607 } => send http request, source: error sending request for url (https://storage.googleapis.com/storage/v1/b/jglll-pctb-temp-asp-solution/o/load_temp%2Fcb85af04-20d9-46de-8cc9-3bdfffedc99d%2Faa23c5b3-20dc-4b1d-a0d9-451913db3a7d%2Fanalyzetable_07d3192c-ef35-4ed8-bf95-b399e4edd97c%2Fbecf4387-03b6-4076-86e3-bab36bcf900a%2Fdata_0.csv?alt=media)).
2024-09-05T05:53:49.316210Z  WARN hickory_proto::xfer::dns_exchange: dns_exchange.rs:215 failed to associate send_message response to the sender

Please check if the server's network capacity has been reached. Databend is extremely fast and can use up to 100Gbps when reading large files concurrently.

Xuanwo commented 1 month ago

In the error file have many io timeout:

Hi @wubx, thanks for helping with the debugging. Please note that an I/O timeout is a result, not a cause. When Databend raises a timeout, it simply indicates that something is slow; it doesn't mean Databend did anything wrong or unexpected.

inviscid commented 1 month ago

We were using 8 vCPU machines and bumped them up to 48 vCPU machines. Google allocates about 2 Gbps per vCPU so this should get us to the node max. After increasing the node size, the imports seem to be working properly now.

It is interesting that the nodes don't seem to have any type of network prioritization for critical items like DNS. Not a Databend problem, just seems like Google would have set some sane priority defaults for critical network functions.

rad-pat commented 1 month ago

@Xuanwo , @wubx Thanks for helping us debug this issue and get it resolved. I'll close this now as it is not a Databend issue.

databendlabs / databend