apache / kvrocks

Apache Kvrocks is a distributed key value NoSQL database that uses RocksDB as storage engine and is compatible with Redis protocol.
https://kvrocks.apache.org/
Apache License 2.0
3.54k stars 465 forks source link

Support full-text searching in KQIR #2419

Open PragmaTwice opened 3 months ago

PragmaTwice commented 3 months ago

Search before asking

Motivation

This includes:

Refer to: https://github.com/pisa-engine/pisa https://clucene.sourceforge.net/ https://redis.io/docs/latest/develop/interact/search-and-query/query/full-text/

Solution

No response

Are you willing to submit a PR?

lbihani9 commented 2 months ago

I'd like to work on this task. Can this be assigned to me?

git-hulk commented 2 months ago

@lbihani9 Assigned, thank you!

lbihani9 commented 1 month ago

@git-hulk, I’ve been going through the codebase to better understand the execution flow and added some test commands. However, I find it quite slow to rebuild everything after every update just to check the changes. Does Kvrocks support any hot-reloading features for development?

I haven’t worked on large C++ projects before, so apologies if this is a naive question! 😅

git-hulk commented 1 month ago

@lbihani9 I can run ./x.py build for the first time building, and then use cd build && make -j4 after that.

lbihani9 commented 1 month ago

Thanks! It's much faster now.

lbihani9 commented 1 month ago

@git-hulk I have been going through the codebase from past couple of days to understand the scope of the task. I've also gone through those 3 links added in this issue's description. I want to understand if we're planning to write the indexing algorithms for full text search from scratch or use apis provided by open-source libraries like clucene?

Also, I noticed that for searching we currently do not support TEXT datatype so we'll need to integrate that as well. Since this task's scope is big, I'm planning to create separate PRs for each subtask (this will also allow me to get better hang of the codebase):

  1. Supporting TEXT datatype.
  2. Implementing word tokenization.
  3. Implementing Indexing algorithm.
  4. Integrating full-text search support with KQIR.

If I've missed any subtask please let me know. Also, do I need to write a doc first and get the idea approved?

git-hulk commented 1 month ago

@lbihani9 Thanks for your efforts.

The KQIR module was created by @PragmaTwice, so I would like to hear suggestions from him.

I want to understand if we're planning to write the indexing algorithms for full text search from scratch or use apis provided by open-source libraries like clucene?

I think we don't need to write the indexing algorithms from scratch. It's good to use open source library whose license is in compliance with ASF requirements[1].

[1] https://www.apache.org/legal/resolved.html#category-a

dmazzella commented 1 week ago

@PragmaTwice add as reference Apache Lucene++