duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
24k stars 1.91k forks source link

Queries Hang On Struct Fields #8731

Closed ameijer closed 11 months ago

ameijer commented 1 year ago

What happens?

Hi DuckDB Team,

We have an interesting issue where a query hangs on a seemingly innocuous SELECT call. The database in question has only one column, a map of structs. We cannot determine the cause of this, but can reproduce it with the data we will provide in the reproduction section.

cc @michaelmdresser

To Reproduce

  1. Download and unzip the bugged duckdb file from https://drive.google.com/file/d/1MhqW9jsOtJdvFcAaixzFknn7UkfTLhOP/view?usp=sharing
  2. open the database in the duckdb CLI
  3. run SELECT * FROM container_1h
  4. Observe the query never returns

OS:

Max OSX

DuckDB Version:

v0.8.1 6536a77232

DuckDB Client:

duckdb CLI

Full Name:

Alex Meijer

Affiliation:

Kubecost

Have you tried this on the latest main branch?

I have tested with a release build (and could not test with a main build)

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Tishj commented 1 year ago

Can you provide a sql query to recreate the db file? Otherwise I think it's hard to rule out a user error, corrupted file etc

ameijer commented 1 year ago

@Tishj the EXPORT command hangs as well when we try to export the database

ameijer commented 1 year ago

as does any attempt to export to parquet or anything

ameijer commented 1 year ago

we have been able to repeatedly build the database in this state, I don't think it is any sort of file corruption issue

Tishj commented 1 year ago

If you can deterministically rebuild the database to this state, could you share the steps you have taken to do so? The db file itself is not very helpful to determine what the cause of this is

michaelmdresser commented 1 year ago

@Tishj While we do our best to make a reliable reproduction (it's very difficult! we encounter this behavior unpredictably), is there a direction you could point us if we wanted to inspect database files and determine if there is corruption? Using the duckdb CLI directly hangs as noted in the issue. Are there references on DuckDB encoding format that might give us a lead?

michaelmdresser commented 1 year ago

Some more information about this behavior. I'm working with a slightly different DB file than Alex is, but they have very similar data and exhibit the same behavior.

I run this query: duckdb kc.ddb.write2 -c "WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-08-28 21:00:00 +0000' AND b.WindowEnd = '2023-08-28 22:00:00 +0000') SELECT * FROM scoped"

That terminal is then stuck (hanging), like so:

Screenshot 2023-09-08 at 11 45 20

htop reports the process is using about 1 CPU core (the column with 118), with near-zero variation in the usage:

Screenshot 2023-09-08 at 11 46 07

I attached lldb to the process and asked for a backtrace with bt this is what it shows:

→ lldb -p 77556
(lldb) process attach --pid 77556
Process 77556 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00000001aaf63e28 libsystem_kernel.dylib`__write_nocancel + 8
libsystem_kernel.dylib`:
->  0x1aaf63e28 <+8>:  b.lo   0x1aaf63e48               ; <+40>
    0x1aaf63e2c <+12>: pacibsp
    0x1aaf63e30 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1aaf63e34 <+20>: mov    x29, sp
Target 0: (duckdb) stopped.
Executable module set to "/opt/homebrew/Cellar/duckdb/0.8.1/bin/duckdb".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aaf63e28 libsystem_kernel.dylib`__write_nocancel + 8
    frame #1: 0x00000001aae667ac libsystem_c.dylib`__swrite + 24
    frame #2: 0x00000001aae486dc libsystem_c.dylib`_swrite + 108
    frame #3: 0x00000001aae4686c libsystem_c.dylib`__sflush + 232
    frame #4: 0x00000001aae4670c libsystem_c.dylib`fflush + 36
    frame #5: 0x0000000104df2430 duckdb`duckdb::ProgressBar::Update(bool) + 172
    frame #6: 0x00000001050e4024 duckdb`duckdb::ClientContext::ExecuteTaskInternal(duckdb::ClientContextLock&, duckdb::PendingQueryResult&) + 104
    frame #7: 0x00000001050e68a0 duckdb`duckdb::PendingQueryResult::ExecuteInternal(duckdb::ClientContextLock&) + 56
    frame #8: 0x00000001050e8e8c duckdb`duckdb::PendingQueryResult::Execute() + 56
    frame #9: 0x00000001050efbb4 duckdb`duckdb::PreparedStatement::Execute(duckdb::vector<duckdb::Value, true>&, bool) + 60
    frame #10: 0x00000001047a6f94 duckdb`duckdb_shell_sqlite3_print_duckbox + 444
    frame #11: 0x000000010479eb98 duckdb`exec_prepared_stmt + 1384
    frame #12: 0x0000000104794110 duckdb`shell_exec + 488
    frame #13: 0x000000010478d160 duckdb`main + 3380
    frame #14: 0x00000001aac47f28 dyld`start + 2236

Other times I run that command, the progress bar doesn't appear. It looks like this:

Screenshot 2023-09-08 at 11 50 49

And when I hook up lldb:

→ lldb -p 77762
(lldb) process attach --pid 77762
Process 77762 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000010520e45c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
duckdb`duckdb::PatasScanState<double>::Skip:
->  0x10520e45c <+116>: subs   x8, x8, #0x1
    0x10520e460 <+120>: b.ne   0x10520e444               ; <+92>
    0x10520e464 <+124>: str    x10, [x19, #0x20]
    0x10520e468 <+128>: str    x11, [x19, #0x30]
Target 0: (duckdb) stopped.
Executable module set to "/opt/homebrew/Cellar/duckdb/0.8.1/bin/duckdb".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000010520e45c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
    frame #1: 0x000000010521dbac duckdb`duckdb::ColumnData::ScanVector(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 232
    frame #2: 0x0000000105235d14 duckdb`duckdb::StandardColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 48
    frame #3: 0x0000000105237120 duckdb`duckdb::StructColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 208
    frame #4: 0x0000000105237120 duckdb`duckdb::StructColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 208
    frame #5: 0x0000000105226d68 duckdb`duckdb::ListColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 388
    frame #6: 0x0000000105222d94 duckdb`duckdb::ColumnData::FilterScan(duckdb::TransactionData, unsigned long long, duckdb::ColumnScanState&, duckdb::Vector&, duckdb::SelectionVector&, unsigned long long) + 64
    frame #7: 0x000000010522cef4 duckdb`void duckdb::RowGroup::TemplatedScan<(duckdb::TableScanType)0>(duckdb::TransactionData, duckdb::CollectionScanState&, duckdb::DataChunk&) + 1004
    frame #8: 0x0000000105231ba8 duckdb`duckdb::CollectionScanState::Scan(duckdb::DuckTransaction&, duckdb::DataChunk&) + 92
    frame #9: 0x0000000105249bbc duckdb`duckdb::DataTable::Scan(duckdb::DuckTransaction&, duckdb::DataChunk&, duckdb::TableScanState&) + 36
    frame #10: 0x0000000104e21b14 duckdb`duckdb::TableScanFunc(duckdb::ClientContext&, duckdb::TableFunctionInput&, duckdb::DataChunk&) + 248
    frame #11: 0x0000000105103404 duckdb`duckdb::PhysicalTableScan::GetData(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSourceInput&) const + 60
    frame #12: 0x00000001051d4bd8 duckdb`duckdb::PipelineExecutor::FetchFromSource(duckdb::DataChunk&) + 144
    frame #13: 0x00000001051d49dc duckdb`duckdb::PipelineExecutor::Execute(unsigned long long) + 216
    frame #14: 0x00000001051d86d8 duckdb`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) + 256
    frame #15: 0x00000001051cd248 duckdb`duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode) + 36
    frame #16: 0x00000001051d087c duckdb`duckdb::Executor::ExecuteTask() + 140
    frame #17: 0x000000010517bff4 duckdb`duckdb::ClientContext::ExecuteTaskInternal(duckdb::ClientContextLock&, duckdb::PendingQueryResult&) + 56
    frame #18: 0x000000010517e8a0 duckdb`duckdb::PendingQueryResult::ExecuteInternal(duckdb::ClientContextLock&) + 56
    frame #19: 0x0000000105180e8c duckdb`duckdb::PendingQueryResult::Execute() + 56
    frame #20: 0x0000000105187bb4 duckdb`duckdb::PreparedStatement::Execute(duckdb::vector<duckdb::Value, true>&, bool) + 60
    frame #21: 0x000000010483ef94 duckdb`duckdb_shell_sqlite3_print_duckbox + 444
    frame #22: 0x0000000104836b98 duckdb`exec_prepared_stmt + 1384
    frame #23: 0x000000010482c110 duckdb`shell_exec + 488
    frame #24: 0x0000000104825160 duckdb`main + 3380
    frame #25: 0x00000001aac47f28 dyld`start + 2236

This situation is the most reproducible: no progress bar, CPU pegged at 1 full core exactly, and lldb reporting us on this instruction:

    frame #0: 0x00000001035a245c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
duckdb`duckdb::PatasScanState<double>::Skip:
->  0x1035a245c <+116>: subs   x8, x8, #0x1

I will be looking into duckdb::PatasScanState<double>::Skip but if any DuckDB experts have a better gut feeling, I would appreciate it.

michaelmdresser commented 1 year ago

The type of the column this is hanging on is MAP(VARCHAR, STRUCT(ByteHours DOUBLE, "Cost" DOUBLE)) NOT NULL. We know this because dropping that column resolves the hang.

Tishj commented 1 year ago

I managed to open the database in 0.8.1 and export it:

>>> import duckdb 
>>> con = duckdb.connect('tmp/hangingquery.duckdb')
>>> con.execute("export database 'tmp/broken_db' (format parquet)")

Then on the latest version of duckdb I imported it

Running the select * from container_1h; query completed succesfully.

michaelmdresser commented 1 year ago

@Tishj thank you for continuing to look into this with us. Annoyingly, I'm no longer personally able to reproduce the hang with the original database file that was uploaded to start this issue, but I do have another DB file which is encountering the hang. I'm happy to share this file with you, but I'd rather not do it publicly just in case it contains sensitive information that doesn't belong on the public internet. Let me know if that would be helpful and what the best method to send the file to you is.

The container_1h table which I am querying is built by calling the duckdb CLI with SQL that looks like this:

LOAD 'parquet';
SET preserve_insertion_order=false;
BEGIN;
INSERT INTO container_1h SELECT * FROM read_parquet('file.parquet');
COMMIT;

The Parquet file itself is generated with this Go library, if that has any relevance: https://github.com/segmentio/parquet-go


Metadata:

→ duckdb -version
v0.8.1 6536a77232
→ sw_vers
ProductName:        macOS
ProductVersion:     13.5.1
BuildVersion:       22G90

This DB file which is encountering the hang is odd. Here are some examples:

This succeeds: duckdb kubecost.duckdb.write -c "SELECT * FROM container_1h"


This hangs: duckdb kubecost.duckdb.write -c "WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"


This succeeds: duckdb kubecost.duckdb.write -c "EXPORT DATABASE 'exported_write' (FORMAT PARQUET)"


This, also, succeeds. It returns a resultset, which I have removed for brevity. It's the same query, but run against the exported Parquet. duckdb -c "IMPORT DATABASE 'exported_write'; WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"


I can try ATTACHing the database file and running the same query. This hangs. duckdb -c "ATTACH 'kubecost.duckdb.write' AS k; WITH scoped AS (SELECT b.* FROM k.container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"


And, finally, the destructive action which drops the supposed bad column but makes the query succeed

duckdb kubecost.duckdb.write -c "ALTER TABLE container_1h DROP COLUMN PVs; WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"

michaelmdresser commented 1 year ago

I'd really like to find out why the query is hanging and fix it, but for us the main thing is that we need the query to return eventually, either with an error or success.

michaelmdresser commented 1 year ago

We've worked around this "hang"/freeze problem by replacing the complex type columns (like MAP(VARCHAR, STRUCT(ByteHours DOUBLE, "Cost" DOUBLE))) with more standard SQL multi-table representations. For my purposes this issue no longer needs to be addressed but I'm pretty sure the underlying problem isn't fixed. For posterity, we haven't tried to reproduce the hang in v0.9.0 of DuckDB.

michaelmdresser commented 1 year ago

I've just watched Mark's 2023 guest lecture for CMU's course https://www.youtube.com/watch?v=bZOvAKGkzpQ. At the 31:45 timestamp he mentions that people sometimes report what they think are hangs but are actually an unseen switch in the DB from a parallel execution model to a single-threaded execution model. Given my earlier observation of "CPU pegged at 1 full core exactly" I'm wondering if we are somehow running into this single-threaded execution transition unknowingly.

@Tishj does there exist a diagnostic to determine if a query will run (or is currently-running) in a single-threaded execution mode?

Tishj commented 11 months ago

I don't think there exists such a diagnostic, or at least we don't expose it.

If the sink, source or any intermediate operator don't support parallelism we don't run the pipeline in parallel.

The check if parallelism can be used starts at the Source, if the source doesn't implement it we fall back to single threaded. It can also limit parallelism to a lower number than the one set by pragma threads.

Once the source has set the parallelism it's comfortable with we check all the regular operators in the pipeline whether they support parallelism, when any of them don't, we fall back to a single threaded execution.

(this is pretty much all done in Pipeline::ScheduleParallel)

michaelmdresser commented 11 months ago

Thank you for the followup and extra detail! I saw https://github.com/duckdb/duckdb/issues/9718 pop up in the issue feed and it sounds somewhat similar, particularly the "pegged cores" behavior I observed when dealing with this issue. Linking in case it helps with the investigation.

szarnyasg commented 11 months ago

@Tishj does there exist a diagnostic to determine if a query will run (or is currently-running) in a single-threaded execution mode?

Indeed as @Tishj said, there is no diagnostic tool to determine whether a query will run in single-threaded mode, nor is it planned to have this in DuckDB in the near future.

Regarding the rest of the issue, this is hard to reproduce with no data sets (the original GDrive link is a 404 now). Feel free to reopen if the issue reappears.