datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.31k stars 704 forks source link

feat(hive) add support to query simple hive table #5895

Closed FANNG1 closed 1 year ago

FANNG1 commented 1 year ago

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

add support to query simple hive table

Changelog

  1. get table location from meta store
  2. list files from location path and create HivePartInfo
  3. in executor side, create HiveParquetBlockReader to read hive files from HivePartInfo

Related Issues

4826

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git โ†—๏ธŽ

Name Status Preview Updated
databend โœ… Ready (Inspect) Visit Preview Jun 20, 2022 at 1:34AM (UTC)
mergify[bot] commented 1 year ago

Thanks for the contribution! I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

BohuTANG commented 1 year ago

@mergify update

mergify[bot] commented 1 year ago

update

โœ… Branch has been successfully updated

dantengsky commented 1 year ago

Hi @sandflee, really appreciate your contribution!!

  1. Noticed that two extra fields partition_keys and location have been introduced to the struct TableMeta

https://github.com/datafuselabs/databend/blob/ec5d0ddd04c7c4eaf2b0bcf8c71dbeb91135a4b2/common/meta/app/src/schema/table.rs#L190-L191

It will be better if we put them "into" the Tablemeta::engine_options" since they are hive table-specific options, the RepoIssuesTable may be a good reference:

https://github.com/datafuselabs/databend/blob/edb9d1346b63247e6c6a84144295955a3b99eee7/query/src/storages/github/repo_issues_table.rs#L42-L44

https://github.com/datafuselabs/databend/blob/edb9d1346b63247e6c6a84144295955a3b99eee7/query/src/storages/github/repo_issues_table.rs#L65-L68

  1. About the stateless test case of GH action test_stateful_hive_standalone

    Hope I get it right that in your local test setup, the databend-query process can access the FS used by Hive/Hadoop directly (the "... user/hive/warehouse/pokes/... "), but for CI, the warehouse data path is "inside" the docker... thus the integration test failed. We can ignore the above integration test failure in this PR. since later, the hive table should use the HDFS storage dal to access the data anyway. we need not consolidate all the functionalities in one PR :)

FANNG1 commented 1 year ago

@dantengsky , thanks for you advice!,
1, HiveTableOptions is a good idea, added to the lastest code . 2, I could query data successfully in non-docker env, and also failed in docker env. but the error show like "connect to datanode failed", not the seem as the test error , not sure whether it's the same problem.

BohuTANG commented 1 year ago

Hi @sandflee

That's great to see you are working on the hive. Thank you! It would be great if we add some SQL tests to the hive stateless file: https://github.com/datafuselabs/databend/blob/main/tests/suites/2_stateful_hive/00_basics/00_0000_hms_basics.sql

FANNG1 commented 1 year ago

yes, add sql tests is very useful and necessary, I try some methodes to work with docker env but failed in my local machine, the main reason is : 1, hdfs is deployed in docker env and in docker's internal network, and the datanode ip is the docker internal ip 2, databend is deployed in machine , and could't communication with datanode.

dantengsky commented 1 year ago

yes, add sql tests is very useful and necessary, I try some methodes to work with docker env but failed in my local machine, the main reason is : 1, hdfs is deployed in docker env and in docker's internal network, and the datanode ip is the docker internal ip 2, databend is deployed in machine , and could't communication with datanode.

@sandflee Roger

please feel free to modify the docker script at

https://github.com/datafuselabs/databend/tree/main/docker/it-hive

if necessary. e.g. exports the necessary ports in the script: https://github.com/datafuselabs/databend/blob/main/docker/it-hive/hive-docker-compose.yml.

but it looks like that the ports of datanodes and namenodes are accessible (I might be completely wrong since I do not know exactly the setup of your local dev):

home-dev:~/workspace/fuse-query$ docker-compose -f "./docker/it-hive/hive-docker-compose.yml" up -d
Recreating it-hive_namenode_1                  ... done
Recreating it-hive_datanode_1                  ... done
Recreating it-hive_hive-metastore-postgresql_1 ... done
Recreating it-hive_hive-server_1               ... done
Recreating it-hive_hive-metastore_1            ... done

home-dev:~/workspace/fuse-query$ sudo netstat -ntlp | grep -e 50070 -e 50075
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      189119/docker-proxy
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      189437/docker-proxy
tcp6       0      0 :::50075                :::*                    LISTEN      189125/docker-proxy
tcp6       0      0 :::50070                :::*                    LISTEN      189444/docker-proxy

home-dev:~/workspace/fuse-query$ telnet localhost 50075
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]
telnet> Connection closed.
home-dev:~/workspace/fuse-query$ telnet localhost 50070
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]
telnet> Connection closed.

And as you might already know, the docker-proxy might be listening to the ports in advance, before the java processes are ready. in this scenario, connecting to the service ports will get transit connection failure.

FANNG1 commented 1 year ago

thanks @dantengsky , the port of my local hdfs conflicts with hdfs in docker, :( I'll try to run databend in hive container, for it needs java and hadoop env. like this: docker exec -it -v $databend_dir:/databend /bin/start_databend.sh

dantengsky commented 1 year ago

cc @Xuanwo

We have an issue here in the CI setup of hdfs storage, would you please give us some hints?

Caused by:
  process didn't exit successfully: `/workspace/target/debug/build/hdfs-sys-637586f141ce0f71/build-script-build` (exit status: 101)
  --- stderr
  thread 'main' panicked at 'JAVA_HOME must be set: NotPresent', /opt/rust/cargo/registry/src/github.com-1ecc6299db9ec823/hdfs-sys-0.2.0/build.rs:10:43
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

https://github.com/datafuselabs/databend/runs/6892054183?check_suite_focus=true#step:3:1109

Xuanwo commented 1 year ago

We need to set up java runtime to make them work like https://github.com/datafuselabs/opendal/blob/main/.github/workflows/service_test_hdfs.yml#L20-L28

But I don't know if this action works under our build tool.

Ping @everpcpc for a look.

FANNG1 commented 1 year ago

I added java env๏ผŒbut seems JAVA_HOME is not setใ€‚ https://github.com/datafuselabs/databend/pull/5895/commits/3b3fe5f9a96ca06a7bbddb0a760fdddb7bd2c837

everpcpc commented 1 year ago

We need to set up java runtime to make them work like https://github.com/datafuselabs/opendal/blob/main/.github/workflows/service_test_hdfs.yml#L20-L28

But I don't know if this action works under our build tool.

Ping @everpcpc for a look.

https://github.com/datafuselabs/databend/blob/main/scripts/setup/dev_setup.sh#L491 JDK is installed here.

dantengsky commented 1 year ago

We need to set up java runtime to make them work like https://github.com/datafuselabs/opendal/blob/main/.github/workflows/service_test_hdfs.yml#L20-L28

But I don't know if this action works under our build tool.

It seems working:

https://github.com/datafuselabs/databend/runs/6892054183?check_suite_focus=true#step:3:118

Run actions/setup-java@v3
Trying to resolve the latest version from remote
Resolved latest version as 11.0.15+10
Trying to download...
Downloading Java 11.0.15+10 (Temurin-Hotspot) from https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.15%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.15_10.tar.gz ...
Extracting Java archive...
/usr/bin/tar xz --warning=no-unknown-keyword -C /runner/_work/_temp/db0bb02a-da8b-42da-ad93-155bd5fb5572 -f /runner/_work/_temp/b5fa93dd-7cb8-4c4a-ac8a-50c236cd6be5
Java 11.0.15+10 was downloaded
Setting Java 11.0.15+10 as the default
Java configuration:
  Distribution: temurin
  Version: 11.0.15+10
  Path: /opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/11.0.15-10/x64
Creating settings.xml with server-id: github
Writing to /home/runner/.m2/settings.xml

will the above action also set the JAVA_HOME ?

BohuTANG commented 1 year ago

@mergify update

mergify[bot] commented 1 year ago

update

โœ… Branch has been successfully updated

dantengsky commented 1 year ago

@mergify update

mergify[bot] commented 1 year ago

update

โœ… Branch has been successfully updated

dantengsky commented 1 year ago

@mergify update

mergify[bot] commented 1 year ago

update

โœ… Branch has been successfully updated

dantengsky commented 1 year ago

Hi @sandflee,

docker image for building has been updated, the build_linux_hive action can be successfully executed now.

but we still have to tweak the test_stateful_hive_standalone a little bit, to meet the runtime dependencies of hdfs storage layer, and make test_stateful_hive_standalone work.

The basic idea is, in the stateful hive test environment, we need to set up a JDK, and export the necessary env vars, so that databend-query with feature storage-hdfs enabled may find the required runtime dependencies.

And I think, we only need to install JDK, and export LD_LIBRARY_PATH properly at this stage (later, if files are accessed by hdfs protocol, some extra jars and env var should be installed/exported).

thus, my suggestions are :

A reference from opendal

https://github.com/datafuselabs/opendal/blob/d8aa1265313aad3280ec09ae9babfe9b74c64d04/.github/workflows/service_test_hdfs.yml#L34-L43

I've not tested the above idea, not sure if it works; but if it seems reasonable to you, would you please have a try?


@Xuanwo in this PR, services-hdfs is just "linked" with databend-query:

Xuanwo commented 1 year ago

is it enough that we just install the JDK and setup the LD_LIBRARY_PATH properly, to bring up databend-query?

Yep. databend-query with services-hdfs just needs libjvm.so to be compiled. The current setup looks good to me.

Xuanwo commented 1 year ago

Sorry for the late, I have approved this workflow run.

FANNG1 commented 1 year ago

thanks, @dantengsky @Xuanwo , databend could startup now, but failed to query hive data, for the storage is "fs" not "hdfs". we may run though test_stateful_hive_standalone by changing the storage.

dantengsky commented 1 year ago

@sandflee

Seems the build and run-time dependencies issues are resolved.

Failure of test_stateful_hive_statndalone is caused by the attempts of accessing hdfs by using fs::backend, which, at least, in the setup of CI, is not possible.

2022-06-17T07:18:23.397842Z  INFO databend_query::servers::mysql::mysql_interactive_worker: Normal query: select * from hive.default.pokes
2022-06-17T07:18:23.505068Z ERROR opendal::services::fs::backend: object /home/runner/work/databend/databend/./.databend/stateless_test_data/user/hive/warehouse/pokes/ list: Custom { kind: NotFound, error: ObjectError { op: "list", path: "/home/runner/work/databend/databend/./.databend/stateless_test_data/user/hive/warehouse/pokes/", source: No such file or directory (os error 2) } }    
2022-06-17T07:18:23.512263Z ERROR databend_query::servers::mysql::writers::query_result_writer: OnQuery Error: Code: 3001, displayText = entity not found (object error: (op: list, path: /home/runner/work/databend/databend/./.databend/stateless_test_data/user/hive/warehouse/pokes/, source: No such file or directory (os error 2))).
FANNG1 commented 1 year ago

Maybe we could split the functionality of access data via hdfs-storage in another PR?

ok, remove storage-hdfs depends in latest code

FANNG1 commented 1 year ago

using a track way to run the hive ci, by replacing hdfs storage with local storage