(Team 2) Design: Query Rewriter

chenlica commented 8 years ago

Team 2:

Please do the following for your task.

Put the design to https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Character-based-Fuzzy-Search
Use this issue to keep track of the progress.

Add @kishore-narendran to this issue.

chenlica commented 8 years ago

Per our discussion in the lecture today, please do the following:

Model the features as operators.
For each operator, come up with a few test cases.
Put your initial design (including the presentation) to your wiki page https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Character-based-Fuzzy-Search.
Update this issue with your progress.

chenlica commented 8 years ago

@shiladityasen and @kishore-narendran: Please refine your design to come up with an interface for the classes. We can schedule a meeting this week when you are ready.

JavierJia commented 8 years ago

Here is the Chinese tokenizer code in SRCH2 codebase: https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197

ss3n commented 8 years ago

Thank you Professor,

We will go through the literature to find how to incorporate that in our module. For now we are going to define the interfaces, write some tests and build a very naive algorithmic version in sandbox.

On Wed, Apr 20, 2016 at 8:45 PM, Jianfeng Jia notifications@github.com wrote:

Here is the Chinese tokenizer code in SRCH2 codebase: https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-212724617

kishore-narendran commented 8 years ago

The current status of the issue is such:

The interface to the QueryRewriter has been setup
QueryRewriter internally uses FuzzyTokenizer to find the permutations of search strings needed.
A few test cases have also been written in QueryRewriterTest.java
A new type of IField was written to reflect the type of ITuple that the QueryRewriter will return i.e. a list of Strings.

The above changes were submitted in the PR https://github.com/TextDB/textdb/pull/75

This is what is up next -

We will write a naive version of the FuzzyTokenizer and test it in textdb-sandbox
The naive version will then be ported into textdb-dataflow to the actual FuzzyTokenizer class
We will then optimize and improve the FuzzyTokenizer algorithm, and iterate

@chenlica @JavierJia

This is currently the status of the issue!

ss3n commented 8 years ago

QueryRewriter will require certain dictionaries to be present (or provided by user) such as dictionary of English words. Which would be the ideal directory to store this dictionary in? @chenlica @JavierJia @rajesh9625 @sandeepreddy602

sandeepreddy602 commented 8 years ago

@shiladityasen... You can create one more source folder src/test/resources and add your directories there.

chenlica commented 8 years ago

@sandeepreddy602 : This English word dictionary is part of the operator, so it should NOT be part the test folder, right? Can each operator have its own local "resources" directory for such files?

sandeepreddy602 commented 8 years ago

@chenlica.. In that case we can create the source folder src/main/resources and create directory for each operator. We can add the operator related data files in the corresponding directory.

chenlica commented 8 years ago

@sandeepreddy602 : Agreed. We can use this PR to create this folder structure. I hope it's easy for each operator to locate that file using a relative path.

ss3n commented 8 years ago

@chenlica @sandeepreddy602 : Just to clarify, we are deciding to create a separate package src/main/resources for every project (such as textdb-dataflow)? Then every operator implemented in the project can access the files in src/main/resources.

Is this correct?

sandeepreddy602 commented 8 years ago

@shiladityasen .. I sent an email regarding this. Please refer that.

ss3n commented 8 years ago

@sandeepreddy602 : Thank you. It is clear now.

ss3n commented 8 years ago

@chenlica : Professor, by our standard protocols for uploading files on Github, we have decided to leave out data files from the git stage. However for FuzzyTokenizer to work, it needs to access a file containing all English words (a word knowledge base) based on which it can perform tokenization of a single term.

Without this file, it cannot work and hence all tests would fail resulting in failure of Travis build.

Am I allowed to push this word knowledge base file as part of src/main/resources folder?

chenlica commented 8 years ago

This English dictionary is critical part of your "FuzzyTokenizer," and should be part of the package on git. How big is it?

@sandeepreddy602 : what's a good folder for this file?

ss3n commented 8 years ago

The dictionary we are using now is 1.2 MB.

On Fri, Apr 29, 2016 at 8:50 AM Chen Li notifications@github.com wrote:

This English dictionary is critical part of your "FuzzyTokenizer," and should be part of the package on git. How big is it?

@sandeepreddy602 https://github.com/sandeepreddy602 : what's a good folder for this file?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-215771252

chenlica commented 8 years ago

Then it's OK to include in git. As extra credit, see if we can zip it and let the program "unzip" it. If not, never mind, since it's not a good deal.

Let's wait for the answer from @sandeepreddy602 about the location for this file.

sandeepreddy602 commented 8 years ago

@shiladityasen.. You can include that file src/main/resources/fuzzymatcher folder.. By default maven packages the files presnt inside src/mai/resources directory.

ss3n commented 8 years ago

@sandeepreddy602 : Understood... I will be placing it in src/main/resources/queryrewriter folder @chenlica : I will try to find a good way to unzip the file into memory on the fly

chenlica commented 8 years ago

@shiladityasen : don't spend more than 1 hour on the "zip" issue since it's NOT important.

kishore-narendran commented 8 years ago

This is the current status of this issue:

An interface, and tests for QueryRewriter was completed in https://github.com/TextDB/textdb/pull/75
A naive implementation of the QueryRewriter was completed in https://github.com/TextDB/textdb/pull/83

This is what is up next:

An optimized version of the implementation will be implemented in an upcoming PR, which will either utilize @JavierJia's dynamic programming approach, OR, we are alternatively looking at an algorithm that utilizes a Trie data structure to perform the tokenization.

chenlica commented 8 years ago

Team 2: Any update on your task? Thanks for agreeing to present the dynamic programming algorithm today. It will be good to create a Google presentation for your work.

ss3n commented 8 years ago

Hello Professor,

We are deciding the basic layout of the presentation and some examples to illustrate the working and where it fits in our application.

chenlica commented 8 years ago

@shiladityasen and @kishore-narendran when can you raise a PR to finish it?

ss3n commented 8 years ago

@chenlica : Professor, we intend to raise a PR later tonight or by tomorrow morning.

kishore-narendran commented 8 years ago

https://github.com/TextDB/textdb/pull/120

The above PR, implements:

The Dynamic Programming algorithm for performing tokenization with highest likelihood.
Modifies test cases.

kishore-narendran commented 8 years ago

@chenlica We completed the documentation, and this can be found at https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Query-Rewriter. Do you think we can proceed to close this issue?

chenlica commented 8 years ago

I will review it first. Thanks.

chenlica commented 8 years ago

I reviewed and polished your wiki page. Now you can close the issue.

Congratulations!

ss3n commented 8 years ago

Thank you so much, Professor :)

On Tue, Jun 7, 2016 at 5:58 PM, Chen Li notifications@github.com wrote:

I reviewed and polished your wiki page. Now you can close the issue.

Congratulations!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-224458121, or mute the thread https://github.com/notifications/unsubscribe/APbwhOH_GNnSKLkm61up4uyWidfbzcjfks5qJhOogaJpZM4H_v0D .

Texera / texera

(Team 2) Design: Query Rewriter #29