Join produces incorrect result.

dieutth commented 6 years ago

Current join produces incorrect result with duplication. In particular for LEFT join:

n>1 samples in REF and m>1 samples in EXP then each region in result is duplicated m times
n>1 samples in REF and m=1 samples in EXP then the result is arbitrary (at least, I haven't seen the "pattern" yet)
n = 1 samples in REF and m>=1samples in EXP: correct result

Here is an example Input: REF dataset contains 2 samples, each sample contain 1 region (without values). Sample0: (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ) EXP dataset contains 2 samples, each sample contain 1 region (without values). Sample0: (chr1, 2, 5, ) Sample1: (chr1, 2, 5, ) Code snippet to produce wrong result.

    val server = new GmqlServer(new GMQLSparkExecutor(sc=sc,outputFormat = GMQLSchemaFormat.TAB))
    val dsFilePath1 = "/.../tmp/ref/"
    val dsFilePath2 = "/.../tmp/exp" 

    val ds1 = server READ(dsFilePath1 ) USING (new CustomParser().setSchema(dsFilePath1))
    val ds2 = server READ(dsFilePath2 ) USING (new CustomParser().setSchema(dsFilePath2))
    var join = ds1.JOIN(None, List(new JoinQuadruple(Some(DistLess(10)))), RegionBuilder.LEFT, ds2)
    server setOutputPath("/.../result/current_system_join/") MATERIALIZE(join)
    server.run()

Expected Result: Sample0: (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ) Sample2: (chr1, 1, 3, ) Sample3: (chr1, 1, 3, ) Actual Result: each sample contains 2 regions instead of 1. Sample0: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample2: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample3: (chr1, 1, 3, ), (chr1, 1, 3, ) See attached for sample input dataset. incorrect_join.tar.gz

pp86 commented 6 years ago

Thanks @dieutth for reporting this.

It looks like there is a problem with one of the steps of the join implementation, more precisely the key of the join at line 95 only contains the id of one of the two operands.

pp86 commented 6 years ago

Hello @dieutth , I have assigned you to this issue along a mail from @akaitoua . But, first, do you agree to take care of this? If so, when you think you can deliver the fix?

I am asking, because in ten days we need to freeze this version and make it public. If you think that you won't have time to work on this now or simply don't agree in doing it :) , can you just let us know asap?

dieutth commented 6 years ago

Hi @pp86 , Sorry I missed the previous email assigned the task for me. I thought Abdo is working on it... Yes, I'm OK to work on this issue. However, I'm not sure I can fix it in time. How about I'm working on this, and if I can't fix it in X days, I'll let you know? I want X=7 but if you need a smaller X, let me know. Or if you think someone else can fix it on time for the delivery, that'd be great too. In any case, kindly let me know. Thank you very much,

pp86 commented 6 years ago

Hi @dieutth, thanks a lot for your kind and fast reply, really appreciated.

Our problem is that we need to close this week; it is not mandatory to have this fix, if you were going to deliver the changes soon we could wait for it, otherwise we can just postpone to next release; so I propose we postpone it to the next release, and, if you agree, you will help us when you'll have time. Is it ok for you?

Thank you again!

dieutth commented 6 years ago

Yeah, sure. That is okay for me :D

akaitoua commented 6 years ago

The bug is fixed. commit [a6a9cb3]

dieutth commented 6 years ago

Great :D. Can you push the latest code @akaitoua ? I already have several test cases and can do a quick check. Also, will be great for my benchmarking. Thanks,

akaitoua commented 6 years ago

@dieutth , I already pushed the commit. [a6a9cb3]

dieutth commented 6 years ago

Ah yeah, my bad. Just saw that.

dieutth commented 6 years ago

@akaitoua : the new code didn't pass the tests. The generated ID in the sample result isn't equal to the hashValue of ID from REF and EXP samples ID. Maybe you can check on this? All other values are correct, duplication is fixed correctly. It's good enough for benchmarking though. Thanks.

akaitoua commented 6 years ago

@dieutth, what tests the code did not pass? your tests? I changed the ref id to be the hash of the reference id. So the output id is hash(hash(refID),expID)

dieutth commented 6 years ago

Yeah, I have several test cases. REF samples S0 = "S_00000.gdm", S1 = "S_00001.gdm" EXP samples S0 = "S_00000.gdm", S1 = "S_00001.gdm"

Expected is S00 = hash(hash(S0), hash(S1)), but yours produces different result. Based on your last comment, it is understandable because you generate: S00 = hash(hash(hash(S0)), hash(S1)) Your change doesn't conform the specification exactly. For complex queries, I think it could make the process of tracing back input data with corresponding output a bit more complicated, but probably it is not a big problem.

acanakoglu commented 6 years ago

We test it. Currently, the output is empty. Could you please check the issue?

You can test on http://genomic.elet.polimi.it/gmql-rest-test-perf2/

akaitoua commented 6 years ago

@dieutth, Now the IDs confirms to the old way of generating IDs. resID = Hash(refID, ExpID)

dieutth commented 6 years ago

@akaitoua : yeah, the new code passed the test this time :v. GJ. @acanakoglu : Before, the result sample ID is different than the ID we receive by using STORERD, so it produces empty output dataset. I think the new code should fix the problem now, as it generates correct ID. Could you please confirm?

acanakoglu commented 6 years ago

Thank you @dieutth for your test to find this issue and then helping us to resolve the bug. @akaitoua confirmed that it is working correctly on the server and I quickly check and it is populating the result by using your example datasets.

DEIB-GECO / GMQL

Join produces incorrect result. #110