Closed dieutth closed 6 years ago
Thanks @dieutth for reporting this.
It looks like there is a problem with one of the steps of the join implementation, more precisely the key of the join at line 95 only contains the id of one of the two operands.
Hello @dieutth , I have assigned you to this issue along a mail from @akaitoua . But, first, do you agree to take care of this? If so, when you think you can deliver the fix?
I am asking, because in ten days we need to freeze this version and make it public. If you think that you won't have time to work on this now or simply don't agree in doing it :) , can you just let us know asap?
Hi @pp86 , Sorry I missed the previous email assigned the task for me. I thought Abdo is working on it... Yes, I'm OK to work on this issue. However, I'm not sure I can fix it in time. How about I'm working on this, and if I can't fix it in X days, I'll let you know? I want X=7 but if you need a smaller X, let me know. Or if you think someone else can fix it on time for the delivery, that'd be great too. In any case, kindly let me know. Thank you very much,
Hi @dieutth, thanks a lot for your kind and fast reply, really appreciated.
Our problem is that we need to close this week; it is not mandatory to have this fix, if you were going to deliver the changes soon we could wait for it, otherwise we can just postpone to next release; so I propose we postpone it to the next release, and, if you agree, you will help us when you'll have time. Is it ok for you?
Thank you again!
Yeah, sure. That is okay for me :D
The bug is fixed. commit [a6a9cb3]
Great :D. Can you push the latest code @akaitoua ? I already have several test cases and can do a quick check. Also, will be great for my benchmarking. Thanks,
@dieutth , I already pushed the commit. [a6a9cb3]
Ah yeah, my bad. Just saw that.
@akaitoua : the new code didn't pass the tests. The generated ID in the sample result isn't equal to the hashValue of ID from REF and EXP samples ID. Maybe you can check on this? All other values are correct, duplication is fixed correctly. It's good enough for benchmarking though. Thanks.
@dieutth, what tests the code did not pass? your tests? I changed the ref id to be the hash of the reference id. So the output id is hash(hash(refID),expID)
Yeah, I have several test cases. REF samples S0 = "S_00000.gdm", S1 = "S_00001.gdm" EXP samples S0 = "S_00000.gdm", S1 = "S_00001.gdm"
Expected is S00 = hash(hash(S0), hash(S1)), but yours produces different result. Based on your last comment, it is understandable because you generate: S00 = hash(hash(hash(S0)), hash(S1)) Your change doesn't conform the specification exactly. For complex queries, I think it could make the process of tracing back input data with corresponding output a bit more complicated, but probably it is not a big problem.
We test it. Currently, the output is empty. Could you please check the issue?
You can test on http://genomic.elet.polimi.it/gmql-rest-test-perf2/
@dieutth, Now the IDs confirms to the old way of generating IDs. resID = Hash(refID, ExpID)
@akaitoua : yeah, the new code passed the test this time :v. GJ. @acanakoglu : Before, the result sample ID is different than the ID we receive by using STORERD, so it produces empty output dataset. I think the new code should fix the problem now, as it generates correct ID. Could you please confirm?
Thank you @dieutth for your test to find this issue and then helping us to resolve the bug. @akaitoua confirmed that it is working correctly on the server and I quickly check and it is populating the result by using your example datasets.
Current join produces incorrect result with duplication. In particular for LEFT join:
Here is an example Input: REF dataset contains 2 samples, each sample contain 1 region (without values). Sample0: (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ) EXP dataset contains 2 samples, each sample contain 1 region (without values). Sample0: (chr1, 2, 5, ) Sample1: (chr1, 2, 5, ) Code snippet to produce wrong result.
Expected Result: Sample0: (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ) Sample2: (chr1, 1, 3, ) Sample3: (chr1, 1, 3, ) Actual Result: each sample contains 2 regions instead of 1. Sample0: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample1: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample2: (chr1, 1, 3, ), (chr1, 1, 3, ) Sample3: (chr1, 1, 3, ), (chr1, 1, 3, ) See attached for sample input dataset. incorrect_join.tar.gz