Closed chenlica closed 8 years ago
Per our discussion in the lecture today, please do the following:
@shiladityasen and @kishore-narendran: Please refine your design to come up with an interface for the classes. We can schedule a meeting this week when you are ready.
Here is the Chinese tokenizer code in SRCH2 codebase: https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197
Thank you Professor,
We will go through the literature to find how to incorporate that in our module. For now we are going to define the interfaces, write some tests and build a very naive algorithmic version in sandbox.
On Wed, Apr 20, 2016 at 8:45 PM, Jianfeng Jia notifications@github.com wrote:
Here is the Chinese tokenizer code in SRCH2 codebase: https://github.com/SRCH2/srch2-ngn/blob/master/src/core/analyzer/ChineseTokenizer.cpp#L197
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-212724617
The current status of the issue is such:
QueryRewriter
has been setupQueryRewriter
internally uses FuzzyTokenizer
to find the permutations of search strings needed.QueryRewriterTest.java
IField
was written to reflect the type of ITuple
that the QueryRewriter
will return i.e. a list of Strings. The above changes were submitted in the PR https://github.com/TextDB/textdb/pull/75
This is what is up next -
FuzzyTokenizer
and test it in textdb-sandbox
textdb-dataflow
to the actual FuzzyTokenizer
classFuzzyTokenizer
algorithm, and iterate@chenlica @JavierJia
This is currently the status of the issue!
QueryRewriter will require certain dictionaries to be present (or provided by user) such as dictionary of English words. Which would be the ideal directory to store this dictionary in? @chenlica @JavierJia @rajesh9625 @sandeepreddy602
@shiladityasen... You can create one more source folder src/test/resources and add your directories there.
@sandeepreddy602 : This English word dictionary is part of the operator, so it should NOT be part the test folder, right? Can each operator have its own local "resources" directory for such files?
@chenlica.. In that case we can create the source folder src/main/resources and create directory for each operator. We can add the operator related data files in the corresponding directory.
@sandeepreddy602 : Agreed. We can use this PR to create this folder structure. I hope it's easy for each operator to locate that file using a relative path.
@chenlica @sandeepreddy602 : Just to clarify, we are deciding to create a separate package src/main/resources
for every project (such as textdb-dataflow
)? Then every operator implemented in the project can access the files in src/main/resources
.
Is this correct?
@shiladityasen .. I sent an email regarding this. Please refer that.
@sandeepreddy602 : Thank you. It is clear now.
@chenlica : Professor, by our standard protocols for uploading files on Github, we have decided to leave out data files from the git stage. However for FuzzyTokenizer to work, it needs to access a file containing all English words (a word knowledge base) based on which it can perform tokenization of a single term.
Without this file, it cannot work and hence all tests would fail resulting in failure of Travis build.
Am I allowed to push this word knowledge base file as part of src/main/resources folder?
This English dictionary is critical part of your "FuzzyTokenizer," and should be part of the package on git. How big is it?
@sandeepreddy602 : what's a good folder for this file?
The dictionary we are using now is 1.2 MB.
On Fri, Apr 29, 2016 at 8:50 AM Chen Li notifications@github.com wrote:
This English dictionary is critical part of your "FuzzyTokenizer," and should be part of the package on git. How big is it?
@sandeepreddy602 https://github.com/sandeepreddy602 : what's a good folder for this file?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-215771252
Then it's OK to include in git. As extra credit, see if we can zip it and let the program "unzip" it. If not, never mind, since it's not a good deal.
Let's wait for the answer from @sandeepreddy602 about the location for this file.
@shiladityasen.. You can include that file src/main/resources/fuzzymatcher folder.. By default maven packages the files presnt inside src/mai/resources directory.
@sandeepreddy602 : Understood... I will be placing it in src/main/resources/queryrewriter folder @chenlica : I will try to find a good way to unzip the file into memory on the fly
@shiladityasen : don't spend more than 1 hour on the "zip" issue since it's NOT important.
This is the current status of this issue:
QueryRewriter
was completed in https://github.com/TextDB/textdb/pull/75QueryRewriter
was completed in https://github.com/TextDB/textdb/pull/83This is what is up next:
Team 2: Any update on your task? Thanks for agreeing to present the dynamic programming algorithm today. It will be good to create a Google presentation for your work.
Hello Professor,
We are deciding the basic layout of the presentation and some examples to illustrate the working and where it fits in our application.
@shiladityasen and @kishore-narendran when can you raise a PR to finish it?
@chenlica : Professor, we intend to raise a PR later tonight or by tomorrow morning.
https://github.com/TextDB/textdb/pull/120
The above PR, implements:
@chenlica We completed the documentation, and this can be found at https://github.com/TextDB/textdb/wiki/CS290-2016S-Task:-Query-Rewriter. Do you think we can proceed to close this issue?
I will review it first. Thanks.
I reviewed and polished your wiki page. Now you can close the issue.
Congratulations!
Thank you so much, Professor :)
On Tue, Jun 7, 2016 at 5:58 PM, Chen Li notifications@github.com wrote:
I reviewed and polished your wiki page. Now you can close the issue.
Congratulations!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextDB/textdb/issues/29#issuecomment-224458121, or mute the thread https://github.com/notifications/unsubscribe/APbwhOH_GNnSKLkm61up4uyWidfbzcjfks5qJhOogaJpZM4H_v0D .
Team 2:
Please do the following for your task.
Add @kishore-narendran to this issue.