chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.02k stars 72 forks source link

Cannot read parquet files from S3 using "*.parquet" #140

Closed neiblegy closed 9 months ago

neiblegy commented 9 months ago

i have 41 parquet files stored in S3, then i need execute sql with: chdb.query(f"select ais_image_path from s3('http://ENDPOINT_URL/BUCKET/KEY_PREFIX/*.parquet', 'USER', 'PWD', Parquet) where ais_image_path = '{path}'", 'Dataframe')

then got error:

Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)

i'm sure that all parquet-files is in the right path i given, and these file can be handled correctly if they are local files.

i change "Dataframe" to "Debug" then got traceback:

2023.12.04 18:20:42.599295 [ 181933 ] {} <Debug> Application: Working directory created: /tmp/clickhouse-local-181933-1701685242-3725016050319400010
Setting up /tmp/clickhouse-local-181933-1701685242-3725016050319400010/tmp/ to store temporary data in it
Added users_xml access storage 'users_xml', path:
00000000-0000-0000-0000-00000002c6ad Authenticating user 'default' from 127.0.0.1:0
00000000-0000-0000-0000-00000002c6ad Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
00000000-0000-0000-0000-00000002c6ad Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
Settings: readonly = 0, allow_ddl = true, allow_introspection_functions = true
List of all grants: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
List of all grants including implicit: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
select ais_image_path from s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', 'xxxxxx', Parquet) where ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12'
00000000-0000-0000-0000-00000002c6ad Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
Settings: readonly = 0, allow_ddl = true, allow_introspection_functions = true
List of all grants: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
List of all grants including implicit: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
(from 0.0.0.0:0, user: ) SELECT ais_image_path FROM s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', '[HIDDEN]', Parquet) WHERE ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12' (stage: Complete)
Access granted: CREATE TEMPORARY TABLE, S3 ON *.*
2023.12.04 18:20:42.641070 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> S3Client: Provider type: Unknown
2023.12.04 18:20:42.641096 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> S3Client: API mode of the S3 client: AWS
2023.12.04 18:20:42.649473 [ 182237 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> HTTPSessionAdapter: Created HTTP(S) session with ceph-c105-sg-drt-aip.s3.sto.shopee.io:80 (10.188.6.18:80)
Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE) (version 23.10.1.1) (from 0.0.0.0:0) (in query: SELECT ais_image_path FROM s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', '[HIDDEN]', Parquet) WHERE ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12'), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(String const&, int) @ 0x0000000019a11479 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000010f7e779 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
2. DB::Exception::Exception<String const&>(int, FormatStringHelperImpl<std::type_identity<String const&>::type>, String const&) @ 0x000000000c2384e3 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
3. DB::(anonymous namespace)::ReadBufferIterator::next() @ 0x0000000016f3f603 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
4. DB::readSchemaFromFormat(String const&, std::optional<DB::FormatSettings> const&, DB::IReadBufferIterator&, bool, std::shared_ptr<DB::Context const>&, std::unique_ptr<DB::ReadBuffer, std::default_delete<DB::ReadBuffer>>&) @ 0x000000001757a6ec in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
5. DB::readSchemaFromFormat(String const&, std::optional<DB::FormatSettings> const&, DB::IReadBufferIterator&, bool, std::shared_ptr<DB::Context const>&) @ 0x000000001757be7f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
6. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x0000000016f36db1 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
7. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x0000000016f363a4 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
8. std::shared_ptr<DB::StorageS3> std::allocate_shared[abi:v15000]<DB::StorageS3, std::allocator<DB::StorageS3>, DB::StorageS3::Configuration&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, void>(std::allocator<DB::StorageS3> const&, DB::StorageS3::Configuration&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&) @ 0x0000000015200ed0 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
9. DB::TableFunctionS3::executeImpl(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool) const @ 0x00000000151fbb4b in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
10. DB::ITableFunction::execute(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool, bool) const @ 0x000000001547747f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
11. DB::Context::executeTableFunction(std::shared_ptr<DB::IAST> const&, DB::ASTSelectQuery const*) @ 0x0000000015cfe896 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
12. DB::JoinedTables::getLeftTableStorage() @ 0x00000000165dee61 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
13. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, std::optional<DB::Pipe>, std::shared_ptr<DB::IStorage> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::PreparedSets>) @ 0x0000000016524a52 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
14. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x0000000016523a97 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
15. DB::InterpreterSelectWithUnionQuery::buildCurrentChildInterpreter(std::shared_ptr<DB::IAST> const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000165c06b2 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
16. DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000165bea5f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
17. std::__unique_if<DB::InterpreterSelectWithUnionQuery>::__unique_single std::make_unique[abi:v15000]<DB::InterpreterSelectWithUnionQuery, std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>&, DB::SelectQueryOptions const&>(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>&, DB::SelectQueryOptions const&) @ 0x00000000168d38f7 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
18. DB::InterpreterFactory::get(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&) @ 0x00000000168d29d5 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
19. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x00000000168b1995 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
20. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x00000000168aeb41 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
21. DB::LocalConnection::sendQuery(DB::ConnectionTimeouts const&, String const&, std::unordered_map<String, String, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, String>>> const&, String const&, unsigned long, DB::Settings const*, DB::ClientInfo const*, bool, std::function<void (DB::Progress const&)>) @ 0x0000000017548050 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
22. DB::ClientBase::processOrdinaryQuery(String const&, std::shared_ptr<DB::IAST>) @ 0x00000000174ef2a4 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
23. DB::ClientBase::processParsedSingleQuery(String const&, String const&, std::shared_ptr<DB::IAST>, std::optional<bool>, bool) @ 0x00000000174edf9a in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
24. DB::ClientBase::executeMultiQuery(String const&) @ 0x00000000174f6f54 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
25. DB::ClientBase::processQueryText(String const&) @ 0x00000000174f7bb7 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
26. DB::ClientBase::runNonInteractive() @ 0x00000000174fa9bb in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
27. DB::LocalServer::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000011004c77 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
28. Poco::Util::Application::run() @ 0x00000000198fd306 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
29. pyEntryClickHouseLocal(int, char**) @ 0x000000001100f13d in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
30. query_stable @ 0x000000001100f44a in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
31. queryToBuffer(String const&, String const&, String const&, String const&) @ 0x000000001bb82cc6 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so

2023.12.04 18:20:42.935293 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Debug> MemoryTracker: Peak memory usage (for query): 74.91 MiB.
00000000-0000-0000-0000-00000002c6ad Logout, user_id: 94309d50-4f52-5250-31bd-74fecac179db
Shutting down UDFs loader
Shutting down named sessions
Shutting down database catalog
Shutting down database INFORMATION_SCHEMA
Shutting down database _local
Shutting down database information_schema
Shutting down system databases
Shutting down DDLWorker
Shutting down caches
2023.12.04 18:20:42.937242 [ 181933 ] {} <Debug> Application: Removing temporary directory: /tmp/clickhouse-local-181933-1701685242-3725016050319400010
Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)
2023.12.04 18:20:42.937474 [ 181933 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem

any string in backtrace like "xxxxxx" actually right strings.

code environment: x86_64 python3.9 chdb:1.0.0 s3: ceph-s3

neiblegy commented 9 months ago

my endpoint is "http://ceph-c105-sg-drt-aip.s3.sto.xxxx.io" , seems there are hard coding process "s3" in it, then treat something wrong as BUCKET

lmangani commented 9 months ago

ClickHouse offers a lot of s3 function and s3 engine related settings which influence the driver and might apply to your case

auxten commented 9 months ago

It's due to some s3 implementation that not fully follow s3 specifications. ClickHouse will use the string before s3 in domain as bucket name. The regex is R"((.+)\.(s3|cos|obs|oss)([.\-][a-z0-9\-.:]+))" But in this issue, bucket name is represented to the path after domain name.

Typically, I will not fix this. But what's wired is the offical awscli could handle these misconfigured s3 storage with specify endpoint and s3 URL separately. like:

aws s3 ls s3://bucket/datasets/ryan_test/ --endpoint http://some-irrelevant-name.s3.xxx.io

I will check it later.

auxten commented 9 months ago

Won't fix, it's a mis-configured S3 issue.