DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Hadoop dead datanode #65

Closed acanakoglu closed 7 years ago

acanakoglu commented 7 years ago

Some operations causes the failure of Hadoop data nodes.

akaitoua commented 7 years ago

It is not something that i can fix if the error was not replicated

marcomass commented 7 years ago

@akaitoua Here below two different queries that lately crashed the system, both on genomics and Cineca. Please dedicate some time to try to fix the issue, which give very bad impression on the system.

I query: CNV = SELECT(manually_curateddataType == 'cnv' AND manually_curated__tumor_tag == 'luad') HG19_TCGA_cnv; MIRNA_GENE = SELECT(manually_curateddataType == 'mirnaseq' AND manually_curated__tumor_tag == 'luad') HG19_TCGA_mirnaseq_mirna; CNV_GENE_0 = MAP(mirna_genes AS BAG(mirna_id); joinby: biospecimen_sample__bcr_sample_barcode) CNV MIRNA_GENE; CNV_GENE = SELECT(region: count_CNV_MIRNA_GENE > 0) CNV_GENE_0; MATERIALIZE CNV_GENE INTO CNV_GENE;

II query:

Select all ChIP-seq samples of cell lines which include at least one sample regarding the antibody_target TEAD4

Selection from newer ENCODE datasets

HM_TF_rep_broad = SELECT(project == 'ENCODE' AND assembly == 'hg19' AND assay == 'ChIP-seq' AND (output_type == "conservative idr thresholded peaks" OR output_type == "optimal idr thresholded peaks" OR output_type == "peaks" OR output_type == "pseudoreplicated idr thresholded peaks" OR output_type == "replicated peaks" OR output_type == "stable peaks") AND biosample_term_name == 'HepG2') HG19_ENCODE_BROAD_NOV_2016; HM_TF_rep_narrow = SELECT(project == 'ENCODE' AND assembly == 'hg19' AND assay == 'ChIP-seq' AND (output_type == "conservative idr thresholded peaks" OR output_type == "optimal idr thresholded peaks" OR output_type == "peaks" OR output_type == "pseudoreplicated idr thresholded peaks" OR output_type == "replicated peaks" OR output_type == "stable peaks") AND biosample_term_name == 'HepG2') HG19_ENCODE_NARROW_NOV_2016; HM_TF_rep = UNION() HM_TF_rep_narrow HM_TF_rep_broad; ## narrowpeak left to keep peak attribute in output MATERIALIZE HM_TF_rep into HM_TF_rep;

HM_TF_rep_good_0 = SELECT(NOT (biosample_treatments == '') AND NOT (audit_error == '') ) HM_TF_rep;

Count the regions in each sample (with the function COUNT()) and add the value in the metadata

HM_TF_rep_good = EXTEND(_Region_number AS COUNT()) HM_TF_rep_good_0;

HM_TF_0 = COVER(1, ANY; groupby: biosample_term_name, experiment_target; aggregate: avg_signal AS AVG(signal)) HM_TF_rep_good; HM_TF = EXTEND(_Region_number_cover AS COUNT()) HM_TF_0; MATERIALIZE HM_TF into HM_TF;

akaitoua commented 7 years ago

@marcomass, Queries above finished execution on Genomic server with produced output. The error might have been resolved by the increment of the resources introduced by the resource levels that @andreagulino added to the system.

marcomass commented 7 years ago

akaitoua tried to fix spark bug by:

andreagulino commented 7 years ago

@marcomass In Cineca it is not happening since we changed the cluster manager to standalone In Genomics after the same changes it still happened, but now we copied the same configurations used in Cineca and also Genomics should work.

acanakoglu commented 7 years ago

In the genomic server, we have still the problem.

akaitoua commented 7 years ago

Configurations were not adequate. They are corrected now.