datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.29k stars 701 forks source link

feat(function): Implement map_cat function #15348

Closed shamb0 closed 1 week ago

shamb0 commented 3 weeks ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Implement map_cat function

Tests

Type of change


This change isโ€‚Reviewable

shamb0 commented 3 weeks ago

Hi @b41sh,

I wanted to let you know that I've completed the first draft of the map_cat implementation. When you have a moment, I would greatly appreciate it if you could review the code and provide your feedback. Please feel free to highlight any areas that need improvement or suggest alternative approaches.

Thank you for taking the time to assist me with this task.

b41sh commented 2 weeks ago

Hi @b41sh,

I wanted to let you know that I've completed the first draft of the map_cat implementation. When you have a moment, I would greatly appreciate it if you could review the code and provide your feedback. Please feel free to highlight any areas that need improvement or suggest alternative approaches.

Thank you for taking the time to assist me with this task.

Hi, @shamb0 Thank you very much for your contribution, but I'm sorry, there are some problem with this implementation. map_cat is a function of type map, not variant. It's true that an object of type JSON can also be considered as a map, but this is different. I'm sorry that I didn't make it clear and caused you problems.

You can start by getting an introduction to the map type, and referring to the implementations of the map_keys and map_values functions in this PR #15291

shamb0 commented 2 weeks ago

Hi @b41sh, I wanted to let you know that I've completed the first draft of the map_cat implementation. When you have a moment, I would greatly appreciate it if you could review the code and provide your feedback. Please feel free to highlight any areas that need improvement or suggest alternative approaches. Thank you for taking the time to assist me with this task.

Hi, @shamb0 Thank you very much for your contribution, but I'm sorry, there are some problem with this implementation. map_cat is a function of type map, not variant. It's true that an object of type JSON can also be considered as a map, but this is different. I'm sorry that I didn't make it clear and caused you problems.

You can start by getting an introduction to the map type, and referring to the implementations of the map_keys and map_values functions in this PR #15291

Hi @b41sh,

Thank you for the suggestion. I appreciate the guidance, and I will proceed with migrating to the map_type approach as recommended.

b41sh commented 2 weeks ago

Hi, @shamb0 I found an article that describes how to write scalar functions, and documentation that describes the map type, which you can refer to for implementation.

https://docs.databend.com/guides/community/contributor/how-to-write-scalar-functions https://docs.databend.com/sql/sql-reference/data-types/data-type-map

The parameter of the map_cat function is a KvColumn, we can use KvColumnBuilder to construct the result column after concat. We have implemented a lot of Array functions, the structure is similar to the Map function, you can refer to the implementations.

shamb0 commented 2 weeks ago

Hi @b41sh,

image

Thank you for your ongoing support!

b41sh commented 1 week ago

Hi, @shamb0 Sorry for the late reply, it looks like a lot of progress has been made, continuing to fix a few minor issues and it will be ready.

shamb0 commented 1 week ago

Hi @b41sh,

No worries, Thank you for the thorough review and feedback. I appreciate the level of detail you've provided. I'll incorporate your feedback and get the update back to you ASAP.

shamb0 commented 1 week ago

Hi @b41sh,

I wanted to give you a quick update on the review reworks I'm working on. They're progressing slowly.

I did want to bring up a challenge I encountered with the ArrayColumnBuilder::append_column() function and get your thoughts on the best approach. In the src/query/functions/src/scalars/map.rs file, I've implemented two versions of the map_cat function:

  1. Implementation-01: map_cat without using ArrayType
  2. Implementation-02: map_cat_impl2 using ArrayType

Both implementations are currently working, but Implementation-02 required a quick patch on the end_offset calculation within ArrayColumnBuilder::append_column(). Please consider this patch as a bit of a temporary fix, and I'm actively working on finding a more elegant solution.

Here's where I could use your insight: since output_map is of type ArrayColumnBuilder<KvColumnBuilder<>>, I'm wondering if we could insert the datasets from the source lhs and rhs maps directly into output_map without using ArrayType, similar to what I did in Implementation-01. From what I've observed, this approach seems to have a lower cost during the concatenation process compared to Implementation-02.

I'd really appreciate your thoughts on the optimal implementation for this context.

Please let me know if you need any further clarification or have additional suggestions.

Thanks for your time and support, @b41sh. I really appreciate it.

b41sh commented 1 week ago

Hi @b41sh,

I wanted to give you a quick update on the review reworks I'm working on. They're progressing slowly.

I did want to bring up a challenge I encountered with the ArrayColumnBuilder::append_column() function and get your thoughts on the best approach. In the src/query/functions/src/scalars/map.rs file, I've implemented two versions of the map_cat function:

  1. Implementation-01: map_cat without using ArrayType
  2. Implementation-02: map_cat_impl2 using ArrayType

Both implementations are currently working, but Implementation-02 required a quick patch on the end_offset calculation within ArrayColumnBuilder::append_column(). Please consider this patch as a bit of a temporary fix, and I'm actively working on finding a more elegant solution.

Here's where I could use your insight: since output_map is of type ArrayColumnBuilder<KvColumnBuilder<>>, I'm wondering if we could insert the datasets from the source lhs and rhs maps directly into output_map without using ArrayType, similar to what I did in Implementation-01. From what I've observed, this approach seems to have a lower cost during the concatenation process compared to Implementation-02.

I'd really appreciate your thoughts on the optimal implementation for this context.

Please let me know if you need any further clarification or have additional suggestions.

Thanks for your time and support, @b41sh. I really appreciate it.

we can use put_item function in ArrayColumnBuilder, and the value is a KvPair struct. for example

            let mut builder = ArrayType::create_builder(capacity, ctx.generics);
            for (k1, v1) in input_map1.iter() {
                builder.put_item((k1, v1));
            }
shamb0 commented 1 week ago

Hi @b41sh,

b41sh commented 1 week ago

Hi @b41sh,

  • I've incorporated updates to meet the review feedback requirements. Could you kindly review them and advise if there are any outstanding areas that require attention?
  • Regarding the sqllogictest and unit tests, I've added basic sanity tests to address the essentials. If you believe additional test coverage is necessary, I'm happy to work on that. Just let me know your thoughts.

Hi @shamb0 Thank you very much for the changes, the implementation of the function looks fine to me. But it looks like some codes are temporary debugging code, could you please remove it to keep the code tidy and complete. Also the ci tests are failing, you'll need to make some changes to the code to ensure it passes all those tests.

shamb0 commented 1 week ago

Hi @b41sh,

  • I've incorporated updates to meet the review feedback requirements. Could you kindly review them and advise if there are any outstanding areas that require attention?
  • Regarding the sqllogictest and unit tests, I've added basic sanity tests to address the essentials. If you believe additional test coverage is necessary, I'm happy to work on that. Just let me know your thoughts.

Hi @shamb0 Thank you very much for the changes, the implementation of the function looks fine to me. But it looks like some codes are temporary debugging code, could you please remove it to keep the code tidy and complete. Also the ci tests are failing, you'll need to make some changes to the code to ensure it passes all those tests.

Hi @b41sh,

I wanted to check if we could incorporate some changes in the tests/sqllogictests/ directory. The main updates are aimed at making the run_file CLI argument fully operational. This would enable us to complete the sqllogictests more efficiently at the test suite file level.

Here's an example command to illustrate how the run_file argument works:

clear && \
    cargo run \
    -p databend-sqllogictests \
    --bin databend-sqllogictests \
    -- \
    --handlers "http" \
    --run_dir functions \
    --run_file 02_0074_function_map.test \
    --debug \
    --parallel 2

Please let me know your thought.

b41sh commented 1 week ago

Hi @b41sh,

I wanted to check if we could incorporate some changes in the tests/sqllogictests/ directory. The main updates are aimed at making the run_file CLI argument fully operational. This would enable us to complete the sqllogictests more efficiently at the test suite file level.

Here's an example command to illustrate how the run_file argument works:

clear && \
    cargo run \
    -p databend-sqllogictests \
    --bin databend-sqllogictests \
    -- \
    --handlers "http" \
    --run_dir functions \
    --run_file 02_0074_function_map.test \
    --debug \
    --parallel 2

Please let me know your thought.

unit-test fails, you need to modify the file src/query/functions/tests/it/scalars/testdata/function_list.txt to add map_cat to the function list, and then run make unit-test to check whether unit-test is passed.

The sql logic test can be run like this First start the databend meta and query, either with the make run-debug command or manually with the command

./target/debug/databend-meta --single --log-level=ERROR
./target/debug/databend-query -c scripts/ci/deploy/config/databend-query-node-1.toml

Then run sqllogictests with the following command

./target/debug/databend-sqllogictests --handlers mysql --run_dir query --skip_dir spill,join,window_function,issues,case_sensitivity
b41sh commented 1 week ago

Hi @shamb0 This PR is almost OK, please resolve the code conflicts with the main branch, and fix the ci unit test.

shamb0 commented 1 week ago

Hi @b41sh, I wanted to check if we could incorporate some changes in the tests/sqllogictests/ directory. The main updates are aimed at making the run_file CLI argument fully operational. This would enable us to complete the sqllogictests more efficiently at the test suite file level. Here's an example command to illustrate how the run_file argument works:

clear && \
    cargo run \
    -p databend-sqllogictests \
    --bin databend-sqllogictests \
    -- \
    --handlers "http" \
    --run_dir functions \
    --run_file 02_0074_function_map.test \
    --debug \
    --parallel 2

Please let me know your thought.

unit-test fails, you need to modify the file src/query/functions/tests/it/scalars/testdata/function_list.txt to add map_cat to the function list, and then run make unit-test to check whether unit-test is passed.

The sql logic test can be run like this First start the databend meta and query, either with the make run-debug command or manually with the command

./target/debug/databend-meta --single --log-level=ERROR
./target/debug/databend-query -c scripts/ci/deploy/config/databend-query-node-1.toml

Then run sqllogictests with the following command

./target/debug/databend-sqllogictests --handlers mysql --run_dir query --skip_dir spill,join,window_function,issues,case_sensitivity

Hi @b41sh,

I created the file src/query/functions/tests/it/scalars/testdata/function_list.txt and ran the make unit-test command as requested. However, I encountered a failure related to databend-common-openai.

Can you please advise if it's okay to ignore this failure and move forward, or if there are any additional steps I should take to resolve it?

PASS [18.870s] databend-common-meta-raft-store::it state_machine::schema_api_impl::test_meta_embedded_single
------------
Summary [25.154s] 1328/1860 tests run: 1326 passed, 2 failed, 4 skipped
FAIL [0.599s] databend-common-openai::it openai::test_openai_sql_completion
FAIL [0.586s] databend-common-openai::it openai::test_openai_text_completion
error: test run failed
make: *** [Makefile:70: unit-test] Error 100
shamb0 commented 1 week ago

Hi @shamb0 This PR is almost OK, please resolve the code conflicts with the main branch, and fix the ci unit test.

Thank you @b41sh. I am working on it. Around 50% done so far. I'll have the rest wrapped up within the next hour.

b41sh commented 1 week ago

Hi @b41sh,

I created the file src/query/functions/tests/it/scalars/testdata/function_list.txt and ran the make unit-test command as requested. However, I encountered a failure related to databend-common-openai.

Can you please advise if it's okay to ignore this failure and move forward, or if there are any additional steps I should take to resolve it?

PASS [18.870s] databend-common-meta-raft-store::it state_machine::schema_api_impl::test_meta_embedded_single
------------
Summary [25.154s] 1328/1860 tests run: 1326 passed, 2 failed, 4 skipped
FAIL [0.599s] databend-common-openai::it openai::test_openai_sql_completion
FAIL [0.586s] databend-common-openai::it openai::test_openai_text_completion
error: test run failed
make: *** [Makefile:70: unit-test] Error 100

src/query/functions/tests/it/scalars/testdata/function_list.txt has exists, don't need to add a new file. The result diff is like following, just add the map_cat function like this is OK, other fails may related with environment and can be ignored.

Differences (-left|+right):
 0 map(Array(Nothing), Array(Nothing)) :: Map(Nothing)
 1 map(Array(Nothing) NULL, Array(Nothing) NULL) :: Map(Nothing) NULL
 2 map(Array(T0), Array(T1)) :: Map(T0, T1)
 3 map(Array(T0) NULL, Array(T1) NULL) :: Map(T0, T1) NULL
+0 map_cat(Map(Nothing), Map(Nothing)) :: Map(Nothing)
+1 map_cat(Map(Nothing) NULL, Map(Nothing) NULL) :: Map(Nothing) NULL
+2 map_cat(Map(T0, T1), Map(T0, T1)) :: Map(T0, T1)
+3 map_cat(Map(T0, T1) NULL, Map(T0, T1) NULL) :: Map(T0, T1) NULL
 0 map_keys(Map(Nothing)) :: Array(Nothing)
 1 map_keys(Map(T0, T1)) :: Array(T0)
 2 map_keys(Map(T0, T1) NULL) :: Array(T0) NULL
 0 map_values(Map(Nothing)) :: Array(Nothing)
shamb0 commented 1 week ago

Hi @b41sh, I created the file src/query/functions/tests/it/scalars/testdata/function_list.txt and ran the make unit-test command as requested. However, I encountered a failure related to databend-common-openai. Can you please advise if it's okay to ignore this failure and move forward, or if there are any additional steps I should take to resolve it?

PASS [18.870s] databend-common-meta-raft-store::it state_machine::schema_api_impl::test_meta_embedded_single
------------
Summary [25.154s] 1328/1860 tests run: 1326 passed, 2 failed, 4 skipped
FAIL [0.599s] databend-common-openai::it openai::test_openai_sql_completion
FAIL [0.586s] databend-common-openai::it openai::test_openai_text_completion
error: test run failed
make: *** [Makefile:70: unit-test] Error 100

src/query/functions/tests/it/scalars/testdata/function_list.txt has exists, don't need to add a new file. The result diff is like following, just add the map_cat function like this is OK, other fails may related with environment and can be ignored.

Differences (-left|+right):
 0 map(Array(Nothing), Array(Nothing)) :: Map(Nothing)
 1 map(Array(Nothing) NULL, Array(Nothing) NULL) :: Map(Nothing) NULL
 2 map(Array(T0), Array(T1)) :: Map(T0, T1)
 3 map(Array(T0) NULL, Array(T1) NULL) :: Map(T0, T1) NULL
+0 map_cat(Map(Nothing), Map(Nothing)) :: Map(Nothing)
+1 map_cat(Map(Nothing) NULL, Map(Nothing) NULL) :: Map(Nothing) NULL
+2 map_cat(Map(T0, T1), Map(T0, T1)) :: Map(T0, T1)
+3 map_cat(Map(T0, T1) NULL, Map(T0, T1) NULL) :: Map(T0, T1) NULL
 0 map_keys(Map(Nothing)) :: Array(Nothing)
 1 map_keys(Map(T0, T1)) :: Array(T0)
 2 map_keys(Map(T0, T1) NULL) :: Array(T0) NULL
 0 map_values(Map(Nothing)) :: Array(Nothing)

I've made some changes to src/query/functions/tests/it/scalars/testdata/function_list.txt. Could you spare a few minutes to review and ensure they align with the requirements? Your quick feedback would be appreciated.

b41sh commented 1 week ago

I've made some changes to src/query/functions/tests/it/scalars/testdata/function_list.txt. Could you spare a few minutes to review and ensure they align with the requirements? Your quick feedback would be appreciated.

unit test is fine, then remove the temporary modification code and we can merge this PR.

shamb0 commented 1 week ago

mysql

I'm running into some issues while trying to complete the sqllogictest.


./target/debug/databend-meta \
  --single \
  --log-level=INFO

./target/debug/databend-query \
  -c scripts/ci/deploy/config/databend-query-node-1.toml \
  --internal-enable-sandbox-tenant \
  --log-level=INFO

cargo run \
  -p databend-sqllogictests \
  --bin databend-sqllogictests \
  -- \
  --handlers "mysql" \
  --run_dir query \
  --skip_dir spill,join,window_function,issues,case_sensitivity \
  --debug \
  --parallel 2
b41sh commented 1 week ago

mysql

I'm running into some issues while trying to complete the sqllogictest.

  • Please find the full trace in the log.
  • Commands executed:
./target/debug/databend-meta \
  --single \
  --log-level=INFO

./target/debug/databend-query \
  -c scripts/ci/deploy/config/databend-query-node-1.toml \
  --internal-enable-sandbox-tenant \
  --log-level=INFO

cargo run \
  -p databend-sqllogictests \
  --bin databend-sqllogictests \
  -- \
  --handlers "mysql" \
  --run_dir query \
  --skip_dir spill,join,window_function,issues,case_sensitivity \
  --debug \
  --parallel 2
  • Are these known issues? Can I ignore them for this PR?
  • Thanks for your help!

this is because you add test after drop database map_func_test please make sure this SQL at the end of tests/sqllogictests/suites/query/functions/02_0074_function_map.test file

statement ok
DROP DATABASE map_func_test
shamb0 commented 1 week ago

LGTM @shamb0 Finally it works, thank you for your constant modifications and retries in the last few days, feel free to keep contributing other code!

Thank you @b41sh,

Thanks for your guidance throughout this PR journey. I've learned a lot from working with you on this stack.

I'm eager to continue contributing. If you have any priority issues in mind, please feel free to assign them to me.

Looking forward to the next challenge!

b41sh commented 1 week ago

@shamb0 I'm glad I could help you, we still have a few more map functions to implement, so if you're interested and have the time, you can pick one and develop it, I'm sure with this experience, the subsequent development will be easy.

shamb0 commented 1 week ago

@shamb0 I'm glad I could help you, we still have a few more map functions to implement, so if you're interested and have the time, you can pick one and develop it, I'm sure with this experience, the subsequent development will be easy.

Thank you @b41sh, I can take this on.