Closed ameijer closed 11 months ago
Can you provide a sql query to recreate the db file? Otherwise I think it's hard to rule out a user error, corrupted file etc
@Tishj the EXPORT command hangs as well when we try to export the database
as does any attempt to export to parquet or anything
we have been able to repeatedly build the database in this state, I don't think it is any sort of file corruption issue
If you can deterministically rebuild the database to this state, could you share the steps you have taken to do so? The db file itself is not very helpful to determine what the cause of this is
@Tishj While we do our best to make a reliable reproduction (it's very difficult! we encounter this behavior unpredictably), is there a direction you could point us if we wanted to inspect database files and determine if there is corruption? Using the duckdb
CLI directly hangs as noted in the issue. Are there references on DuckDB encoding format that might give us a lead?
Some more information about this behavior. I'm working with a slightly different DB file than Alex is, but they have very similar data and exhibit the same behavior.
I run this query:
duckdb kc.ddb.write2 -c "WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-08-28 21:00:00 +0000' AND b.WindowEnd = '2023-08-28 22:00:00 +0000') SELECT * FROM scoped"
That terminal is then stuck (hanging), like so:
htop
reports the process is using about 1 CPU core (the column with 118), with near-zero variation in the usage:
I attached lldb
to the process and asked for a backtrace with bt
this is what it shows:
→ lldb -p 77556
(lldb) process attach --pid 77556
Process 77556 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00000001aaf63e28 libsystem_kernel.dylib`__write_nocancel + 8
libsystem_kernel.dylib`:
-> 0x1aaf63e28 <+8>: b.lo 0x1aaf63e48 ; <+40>
0x1aaf63e2c <+12>: pacibsp
0x1aaf63e30 <+16>: stp x29, x30, [sp, #-0x10]!
0x1aaf63e34 <+20>: mov x29, sp
Target 0: (duckdb) stopped.
Executable module set to "/opt/homebrew/Cellar/duckdb/0.8.1/bin/duckdb".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00000001aaf63e28 libsystem_kernel.dylib`__write_nocancel + 8
frame #1: 0x00000001aae667ac libsystem_c.dylib`__swrite + 24
frame #2: 0x00000001aae486dc libsystem_c.dylib`_swrite + 108
frame #3: 0x00000001aae4686c libsystem_c.dylib`__sflush + 232
frame #4: 0x00000001aae4670c libsystem_c.dylib`fflush + 36
frame #5: 0x0000000104df2430 duckdb`duckdb::ProgressBar::Update(bool) + 172
frame #6: 0x00000001050e4024 duckdb`duckdb::ClientContext::ExecuteTaskInternal(duckdb::ClientContextLock&, duckdb::PendingQueryResult&) + 104
frame #7: 0x00000001050e68a0 duckdb`duckdb::PendingQueryResult::ExecuteInternal(duckdb::ClientContextLock&) + 56
frame #8: 0x00000001050e8e8c duckdb`duckdb::PendingQueryResult::Execute() + 56
frame #9: 0x00000001050efbb4 duckdb`duckdb::PreparedStatement::Execute(duckdb::vector<duckdb::Value, true>&, bool) + 60
frame #10: 0x00000001047a6f94 duckdb`duckdb_shell_sqlite3_print_duckbox + 444
frame #11: 0x000000010479eb98 duckdb`exec_prepared_stmt + 1384
frame #12: 0x0000000104794110 duckdb`shell_exec + 488
frame #13: 0x000000010478d160 duckdb`main + 3380
frame #14: 0x00000001aac47f28 dyld`start + 2236
Other times I run that command, the progress bar doesn't appear. It looks like this:
And when I hook up lldb
:
→ lldb -p 77762
(lldb) process attach --pid 77762
Process 77762 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000010520e45c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
duckdb`duckdb::PatasScanState<double>::Skip:
-> 0x10520e45c <+116>: subs x8, x8, #0x1
0x10520e460 <+120>: b.ne 0x10520e444 ; <+92>
0x10520e464 <+124>: str x10, [x19, #0x20]
0x10520e468 <+128>: str x11, [x19, #0x30]
Target 0: (duckdb) stopped.
Executable module set to "/opt/homebrew/Cellar/duckdb/0.8.1/bin/duckdb".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000010520e45c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
frame #1: 0x000000010521dbac duckdb`duckdb::ColumnData::ScanVector(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 232
frame #2: 0x0000000105235d14 duckdb`duckdb::StandardColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 48
frame #3: 0x0000000105237120 duckdb`duckdb::StructColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 208
frame #4: 0x0000000105237120 duckdb`duckdb::StructColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 208
frame #5: 0x0000000105226d68 duckdb`duckdb::ListColumnData::ScanCount(duckdb::ColumnScanState&, duckdb::Vector&, unsigned long long) + 388
frame #6: 0x0000000105222d94 duckdb`duckdb::ColumnData::FilterScan(duckdb::TransactionData, unsigned long long, duckdb::ColumnScanState&, duckdb::Vector&, duckdb::SelectionVector&, unsigned long long) + 64
frame #7: 0x000000010522cef4 duckdb`void duckdb::RowGroup::TemplatedScan<(duckdb::TableScanType)0>(duckdb::TransactionData, duckdb::CollectionScanState&, duckdb::DataChunk&) + 1004
frame #8: 0x0000000105231ba8 duckdb`duckdb::CollectionScanState::Scan(duckdb::DuckTransaction&, duckdb::DataChunk&) + 92
frame #9: 0x0000000105249bbc duckdb`duckdb::DataTable::Scan(duckdb::DuckTransaction&, duckdb::DataChunk&, duckdb::TableScanState&) + 36
frame #10: 0x0000000104e21b14 duckdb`duckdb::TableScanFunc(duckdb::ClientContext&, duckdb::TableFunctionInput&, duckdb::DataChunk&) + 248
frame #11: 0x0000000105103404 duckdb`duckdb::PhysicalTableScan::GetData(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSourceInput&) const + 60
frame #12: 0x00000001051d4bd8 duckdb`duckdb::PipelineExecutor::FetchFromSource(duckdb::DataChunk&) + 144
frame #13: 0x00000001051d49dc duckdb`duckdb::PipelineExecutor::Execute(unsigned long long) + 216
frame #14: 0x00000001051d86d8 duckdb`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) + 256
frame #15: 0x00000001051cd248 duckdb`duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode) + 36
frame #16: 0x00000001051d087c duckdb`duckdb::Executor::ExecuteTask() + 140
frame #17: 0x000000010517bff4 duckdb`duckdb::ClientContext::ExecuteTaskInternal(duckdb::ClientContextLock&, duckdb::PendingQueryResult&) + 56
frame #18: 0x000000010517e8a0 duckdb`duckdb::PendingQueryResult::ExecuteInternal(duckdb::ClientContextLock&) + 56
frame #19: 0x0000000105180e8c duckdb`duckdb::PendingQueryResult::Execute() + 56
frame #20: 0x0000000105187bb4 duckdb`duckdb::PreparedStatement::Execute(duckdb::vector<duckdb::Value, true>&, bool) + 60
frame #21: 0x000000010483ef94 duckdb`duckdb_shell_sqlite3_print_duckbox + 444
frame #22: 0x0000000104836b98 duckdb`exec_prepared_stmt + 1384
frame #23: 0x000000010482c110 duckdb`shell_exec + 488
frame #24: 0x0000000104825160 duckdb`main + 3380
frame #25: 0x00000001aac47f28 dyld`start + 2236
This situation is the most reproducible: no progress bar, CPU pegged at 1 full core exactly, and lldb
reporting us on this instruction:
frame #0: 0x00000001035a245c duckdb`duckdb::PatasScanState<double>::Skip(duckdb::ColumnSegment&, unsigned long long) + 116
duckdb`duckdb::PatasScanState<double>::Skip:
-> 0x1035a245c <+116>: subs x8, x8, #0x1
I will be looking into duckdb::PatasScanState<double>::Skip
but if any DuckDB experts have a better gut feeling, I would appreciate it.
The type of the column this is hanging on is MAP(VARCHAR, STRUCT(ByteHours DOUBLE, "Cost" DOUBLE)) NOT NULL
. We know this because dropping that column resolves the hang.
I managed to open the database in 0.8.1 and export it:
>>> import duckdb
>>> con = duckdb.connect('tmp/hangingquery.duckdb')
>>> con.execute("export database 'tmp/broken_db' (format parquet)")
Then on the latest version of duckdb I imported it
Running the select * from container_1h;
query completed succesfully.
@Tishj thank you for continuing to look into this with us. Annoyingly, I'm no longer personally able to reproduce the hang with the original database file that was uploaded to start this issue, but I do have another DB file which is encountering the hang. I'm happy to share this file with you, but I'd rather not do it publicly just in case it contains sensitive information that doesn't belong on the public internet. Let me know if that would be helpful and what the best method to send the file to you is.
The container_1h
table which I am querying is built by calling the duckdb
CLI with SQL that looks like this:
LOAD 'parquet';
SET preserve_insertion_order=false;
BEGIN;
INSERT INTO container_1h SELECT * FROM read_parquet('file.parquet');
COMMIT;
The Parquet file itself is generated with this Go library, if that has any relevance: https://github.com/segmentio/parquet-go
Metadata:
→ duckdb -version
v0.8.1 6536a77232
→ sw_vers
ProductName: macOS
ProductVersion: 13.5.1
BuildVersion: 22G90
This DB file which is encountering the hang is odd. Here are some examples:
This succeeds:
duckdb kubecost.duckdb.write -c "SELECT * FROM container_1h"
This hangs:
duckdb kubecost.duckdb.write -c "WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"
This succeeds:
duckdb kubecost.duckdb.write -c "EXPORT DATABASE 'exported_write' (FORMAT PARQUET)"
This, also, succeeds. It returns a resultset, which I have removed for brevity. It's the same query, but run against the exported Parquet.
duckdb -c "IMPORT DATABASE 'exported_write'; WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"
I can try ATTACH
ing the database file and running the same query. This hangs.
duckdb -c "ATTACH 'kubecost.duckdb.write' AS k; WITH scoped AS (SELECT b.* FROM k.container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"
And, finally, the destructive action which drops the supposed bad column but makes the query succeed
duckdb kubecost.duckdb.write -c "ALTER TABLE container_1h DROP COLUMN PVs; WITH scoped AS (SELECT b.* FROM container_1h b WHERE b.WindowStart = '2023-09-19 21:00:00 +0000' AND b.WindowEnd = '2023-09-19 22:00:00 +0000') SELECT * FROM scoped"
I'd really like to find out why the query is hanging and fix it, but for us the main thing is that we need the query to return eventually, either with an error or success.
We've worked around this "hang"/freeze problem by replacing the complex type columns (like MAP(VARCHAR, STRUCT(ByteHours DOUBLE, "Cost" DOUBLE))
) with more standard SQL multi-table representations. For my purposes this issue no longer needs to be addressed but I'm pretty sure the underlying problem isn't fixed. For posterity, we haven't tried to reproduce the hang in v0.9.0 of DuckDB.
I've just watched Mark's 2023 guest lecture for CMU's course https://www.youtube.com/watch?v=bZOvAKGkzpQ. At the 31:45 timestamp he mentions that people sometimes report what they think are hangs but are actually an unseen switch in the DB from a parallel execution model to a single-threaded execution model. Given my earlier observation of "CPU pegged at 1 full core exactly" I'm wondering if we are somehow running into this single-threaded execution transition unknowingly.
@Tishj does there exist a diagnostic to determine if a query will run (or is currently-running) in a single-threaded execution mode?
I don't think there exists such a diagnostic, or at least we don't expose it.
If the sink, source or any intermediate operator don't support parallelism we don't run the pipeline in parallel.
The check if parallelism can be used starts at the Source, if the source doesn't implement it we fall back to single threaded.
It can also limit parallelism to a lower number than the one set by pragma threads
.
Once the source has set the parallelism it's comfortable with we check all the regular operators in the pipeline whether they support parallelism, when any of them don't, we fall back to a single threaded execution.
(this is pretty much all done in Pipeline::ScheduleParallel
)
Thank you for the followup and extra detail! I saw https://github.com/duckdb/duckdb/issues/9718 pop up in the issue feed and it sounds somewhat similar, particularly the "pegged cores" behavior I observed when dealing with this issue. Linking in case it helps with the investigation.
@Tishj does there exist a diagnostic to determine if a query will run (or is currently-running) in a single-threaded execution mode?
Indeed as @Tishj said, there is no diagnostic tool to determine whether a query will run in single-threaded mode, nor is it planned to have this in DuckDB in the near future.
Regarding the rest of the issue, this is hard to reproduce with no data sets (the original GDrive link is a 404 now). Feel free to reopen if the issue reappears.
What happens?
Hi DuckDB Team,
We have an interesting issue where a query hangs on a seemingly innocuous SELECT call. The database in question has only one column, a map of structs. We cannot determine the cause of this, but can reproduce it with the data we will provide in the reproduction section.
cc @michaelmdresser
To Reproduce
SELECT * FROM container_1h
OS:
Max OSX
DuckDB Version:
v0.8.1 6536a77232
DuckDB Client:
duckdb CLI
Full Name:
Alex Meijer
Affiliation:
Kubecost
Have you tried this on the latest
main
branch?I have tested with a release build (and could not test with a main build)
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?