DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

COVER empty results #61

Closed Erlaad closed 7 years ago

Erlaad commented 7 years ago

During testing for resolution of issue #14 , the user attempted the attached queries (compacted for convenience, commented commands are alternative to each other). Results are empty samples, which is inconsistent with the query (input samples are nonempty, so COVER(1,ANY) should report some results, no matter how scattered).

Note that the query was successfully completed, that the user attempted both output formats (likely irrelevant but for completeness sake) and also tried with or without the aggregate command. 20170718_null_average_tests.gmql.txt

akaitoua commented 7 years ago

I tested the data on my local machine and the result was full, it did not take more than a minute to finish. The problem is in the configurations of Spark on the server. From the log i can see that there is an RPC connection close because of limited resources for the job. @Erlaad, Please try to change the configurations and test.

WARN [Dispatcher] Message RemoteProcessDisconnected(131.175.120.18:33012) dropped. RpcEnv already stopped.

akaitoua commented 7 years ago

The main cause of the error has been identified: The data that Perna is using is a “One-based" data stored in Bed format. We automatically consider all Bed formats as a zero-based while GTF and VCF as one-based. So when we reach the Cover operation, the cover produces no result since our system is Zero-based and thus the bellow data has a start and stop at the same place (the region cancels itself).

Possible solution: Select the data, then store it as GTF, load the GTF as your data set and work on it. The bug should also be solved in the dataset schema by adding in the Schema the numbering system (Olya is working on this)

42 chr19 11665451 11665451 - cg03649060 null ELOF1 84337 43 chr19 59055127 59055127 + cg03651573 null TRIM28 10155 44 chr10 69835287 69835287 - cg03652343 0.0305373103969918 HERC4 26091 45 chr2 172967014 172967014 - cg03657766 null DLX2 1746 46 chr3 184081400 184081400 + cg00971396 0.0116462679338981 POLR2H 5437 47 chr7 994666 994666 - cg00969446 0.552249816759621 ADAP1 11033 48 chr9 117093052 117093052 + cg00969271 0.831308275444459 ORM2 5005 49 chrX 101975738 101975738 + cg00968475 0.0708482862260768 BHLHB9 80823 50 chr7 113558794 113558794 - cg00967316 0.559054576745751 PPP1R3A 5506 51 chr15 74753785 74753785 - cg00964109 0.0122457713801143 UBL7 84993 52 chrX 52651887 52651887 + cg00962799 0.8454607615888 SSX8 280659

Erlaad commented 7 years ago

The problem is still happening (tested today on TCGA data), so how is it closed? I'll try the workaround now.

Test query: RAW = SELECT(clinical_follow_up__tumor_status == "with tumor" AND clinical_follow_upnew_tumor_event_type == "distant metastasis" AND clinical_patientmetastatic_site == "lung") HG19_TCGA_dnamethylation; TEST = COVER(1,ANY; aggregate: new_beta_value AS AVG(beta_value)) RAW;

MATERIALIZE RAW into raw; MATERIALIZE TEST into test;

Edit: still not working after workaround. MATERLIAZEd into GTF and try query again, same result. Suggest reopening Bug.

akaitoua commented 7 years ago

I checked the data generated from your query, it looks that the GTF generated data from the work around, does has a major error (start is greater than stop). This issue is related to [Issue #27]. Once it is solved this issue is supposed to be closed.

OlgaGorlova commented 7 years ago

The issue #27 is fixed now. Currently, the following workaround should work: Select the data, then store it as TAB, then open schema file and change the value of the coordinate_system attribute to "1-based". Load your data set and work on it.

acanakoglu commented 7 years ago

I updated the xsd file that controls the xml for the web interface too. You don't need to do trick anymore. @OlgaGorlova, could you please test it?

OlgaGorlova commented 7 years ago

Thanks, @acanakoglu . It is working now, the schema is correctly recognised when loading dataset.

Erlaad commented 7 years ago

I tested with the query that failed in #14,

RAW = SELECT(clinical_follow_uptumor_status == 'with tumor' AND manually_curateddataType == 'dnamethylation27' AND clinical_follow_up__new_tumor_event_type == 'distant metastasis') HG19_TCGA_dnamethylation; TEST = COVER(1,ANY; aggregate: new_beta_value AS AVG(beta_value)) RAW;

MATERIALIZE RAW into raw; MATERIALIZE TEST into test;

Suggested workaround seems to work. Problems seems now to be in the way the original dataset is handled (is a tab-delimited which is actually 0-based). Can close this bug form my side.

To be further tested after closing issue GMQL-web #53

Erlaad commented 7 years ago

Tested again today on the same query, issue seems resolved.

~ Stefano