br1ghtyang / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

FuzzyJoin only accepts variables as arguments, no function can be applied to its inputs #619

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
FuzzyJoinRule does not fire for the following test case:

use dataverse fuzzyjoin;

set simthreshold '.5f';

for $dblp in dataset('DBLP')
for $dblp2 in dataset('DBLP')
where word-tokens($dblp.title) ~= word-tokens($dblp2.title) and $dblp.id < 
$dblp2.id
order by $dblp.id, $dblp2.id
return {'dblp': $dblp, 'dblp2': $dblp2}

The current FuzzyJoinRule bails because of the "word-tokens" function. It 
automatically injects a tokenizer if a string variable is given as an input, or 
it perfectly works if the input is a list of strings. The reason this query 
does not work is the rule bails when the inputs are anything other than a 
variable, in this particular case a "Scalar Function".

Original issue reported on code.google.com by icetin...@gmail.com on 23 Aug 2013 at 7:57

GoogleCodeExporter commented 8 years ago
I managed to run this query using FuzzyJoinRule instead of Nested Loop Join. 
However, it returns the following extra line which does not exist in the 
expected result set:

{ "dblp": { "id": 21, "dblpid": "books/acm/kim95/MengY95", "title": "Query 
Processing in Multidatabase Systems.", "authors": "Weiyi Meng Clement T. Yu", 
"misc": "2002-01-03 551-572 1995 Modern Database Systems 
db/books/collections/kim95.html#MengY95" }, "dblp2": { "id": 24, "dblpid": 
"books/acm/kim95/OzsuB95", "title": "Query Processing in Object-Oriented 
Database Systems.", "authors": "M. Tamer Özsu José A. Blakeley", "misc": 
"2002-01-03 146-174 1995 Modern Database Systems 
db/books/collections/kim95.html#OzsuB95" } }

I think this should be in the result set since the similarity is 
intersection size / union size
= [Query, Processing, in, Systems] / [Query, Processing, in, Multidatabase, 
Systems, Object, Oriented, Database]
 = 4 / 8 = 0.5. 

It wasn't complaining when it was running as NL join. Then does it mean we have 
a problem in the expected results, and NL Join, or am I missing something?

Original comment by icetin...@gmail.com on 13 Sep 2013 at 9:42

GoogleCodeExporter commented 8 years ago
Sounds like a problem, indeed!  The NL join is using functions that are passed 
to it by the compiler to do the joining - so there may be a problem there, also 
related to this, then.  Apparently the fuzzy functions used there are 
inconsistent with those used in this case?  BUG!

Original comment by dtab...@gmail.com on 13 Sep 2013 at 5:12

GoogleCodeExporter commented 8 years ago

Original comment by icetin...@gmail.com on 26 Sep 2013 at 12:30