broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
176 stars 88 forks source link

Issue uploading VCF file with vep in step 5 #911

Closed Toseph closed 5 years ago

Toseph commented 5 years ago

During the upload phase for step 5, I am running into an error with VEP while it attempts to run the file. Below are the steps I took to run the command and the second blurb is where the error shows up.

GENOME_VERSION="37"        # should be "37" or "38"
SAMPLE_TYPE="WES"          # can be "WES" or "WGS"
DATASET_TYPE="VARIANTS"    # can be "VARIANTS" (for GATK VCFs) or "SV" (for Manta VCFs)
PROJECT_GUID="R0001_project1"   # should match the ID in the url of the project page
INPUT_VCF="test.germline.vcf.gz"    # local path of VCF file

python2.7 gcloud_dataproc/submit.py --run-locally hail_scripts/v01/load_dataset_to_es.py  --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation  --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --use-nested-objects-for-genotypes $INPUT_VCF

Error comes during step 15:

/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit                  --driver-memory 5G         --executor-memory 5G         --num-executors 4         --conf spark.driver.extraJavaOptions=-Xss4M         --conf spark.executor.extraJavaOptions=-Xss4M         --conf spark.executor.memoryOverhead=5g         --conf spark.driver.maxResultSize=30g         --conf spark.kryoserializer.buffer.max=1g         --conf spark.memory.fraction=0.1         --conf spark.default.parallelism=1         --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar         --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar         --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar         --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip         "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "test.germline.vcf.gz" --username 'root' --directory 'seqr02.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines'

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Requirement already satisfied: elasticsearch in /usr/lib/python2.7/site-packages (7.0.1)
Requirement already satisfied: urllib3>=1.21.1 in /usr/lib/python2.7/site-packages (from elasticsearch) (1.24.3)
2019-06-18 23:04:04,598 INFO     Index name: r0001_project1__wes__grch37__variants__20190618
2019-06-18 23:04:04,598 INFO     Command args:
/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py --index r0001_project1__wes__grch37__variants__20190618
2019-06-18 23:04:04,602 INFO     Parsed args:
{'cpu_limit': None,
 'create_snapshot': False,
 'dataset_type': 'VARIANTS',
 'directory': 'seqr02.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines',
 'discard_missing_genotypes': False,
 'dont_delete_intermediate_vds_files': False,
 'dont_update_operations_log': False,
 'es_block_size': 10,
 'exclude_1kg': False,
 'exclude_cadd': False,
 'exclude_clinvar': False,
 'exclude_dbnsfp': False,
 'exclude_eigen': False,
 'exclude_exac': False,
 'exclude_gene_constraint': False,
 'exclude_gnomad': False,
 'exclude_gnomad_coverage': False,
 'exclude_hgmd': True,
 'exclude_mpc': False,
 'exclude_omim': False,
 'exclude_primate_ai': False,
 'exclude_splice_ai': False,
 'exclude_topmed': False,
 'exclude_vcf_info_field': False,
 'export_vcf': False,
 'fam_file': None,
 'family_id': None,
 'filter_interval': '1-MT',
 'genome_version': '37',
 'host': 'localhost',
 'ignore_extra_sample_ids_in_tables': False,
 'ignore_extra_sample_ids_in_vds': False,
 'index': 'r0001_project1__wes__grch37__variants__20190618',
 'individual_id': None,
 'input_dataset': 'test.germline.vcf.gz',
 'max_samples_per_index': 250,
 'not_gatk_genotypes': False,
 'num_shards': 1,
 'only_export_to_elasticsearch_at_the_end': False,
 'output_vds': None,
 'port': '9200',
 'project_guid': 'R0001_project1',
 'remap_sample_ids': None,
 'sample_type': 'WES',
 'skip_annotations': False,
 'skip_validation': True,
 'skip_vep': False,
 'skip_writing_intermediate_vds': False,
 'start_with_sample_group': 0,
 'start_with_step': 0,
 'stop_after_step': 1000,
 'subset_samples': None,
 'use_child_docs_for_genotypes': False,
 'use_nested_objects_for_genotypes': True,
 'use_nested_objects_for_vep': True,
 'use_temp_loading_nodes': False,
 'username': 'root',
 'vep_block_size': 100}
2019-06-18 23:04:04,602 INFO
==> create HailContext
Running on Apache Spark version 2.0.2
SparkUI available at http://10.1.27.222:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-105a497
2019-06-18 23:04:06,880 INFO     is_running_locally = True
2019-06-18 23:04:06,880 INFO

=============================== pipeline - step 0 - run vep ===============================
2019-06-18 23:04:06,880 INFO
==> import: test.germline.vcf.gz
[Stage 0:======================================================>(224 + 2) / 226]2019-06-18 23:04:25 Hail: INFO: Multiallelic variants detected. Some methods require splitting or filtering multiallelics first.
[Stage 1:======================================================>(223 + 3) / 226]2019-06-18 23:04:30 Hail: INFO: Ordering unsorted dataset with network shuffle
[Stage 5:=====================================================>(998 + 2) / 1000]2019-06-18 23:07:11,077 INFO
==> set filter interval to: 1-MT
2019-06-18 23:07:11 Hail: INFO: interval filter loaded 987 of 1000 partitions
2019-06-18 23:07:11,109 INFO     Callset stats:
[Stage 10:=====================================================>(984 + 3) / 987]Summary(samples=1, variants=4909438, call_rate=1.000000, contigs=['X', '12', '8', '19', '4', '15', '11', '9', 'Y', '22', '13', '16', '5', '10', '21', '6', '1', '17', '14', 'MT', '20', '2', '18', '7', '3'], multiallelics=0, snps=3991177, mnps=0, insertions=446046, deletions=472215, complex=0, star=0, max_alleles=2)
2019-06-18 23:10:37,284 INFO
==> total variants: 4909438
[Stage 15:>                                                       (0 + 4) / 987]Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
^X^CTraceback (most recent call last):
  File "gcloud_dataproc/submit.py", line 99, in <module>
    subprocess.check_call(command, shell=True)
  File "/usr/lib64/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib64/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
[root@seqr02 hail_elasticsearch_pipelines]# 2019-06-18 23:10:42,509 INFO     Error while receiving.
Traceback (most recent call last):
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1028, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib64/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 222, in signal_handler
    self.cancelAllJobs()
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 903, in cancelAllJobs
    self._jsc.sc().cancelAllJobs()
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o82.cancelAllJobs.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.SparkContext.<init>(SparkContext.scala:77)
is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:96)
is.hail.HailContext$.apply(HailContext.scala:166)
is.hail.HailContext.apply(HailContext.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:280)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)

The currently active SparkContext was created at:

org.apache.spark.SparkContext.<init>(SparkContext.scala:77)
is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:96)
is.hail.HailContext$.apply(HailContext.scala:166)
is.hail.HailContext.apply(HailContext.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:280)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)

        at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:101)
        at org.apache.spark.SparkContext.cancelAllJobs(SparkContext.scala:2012)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

2019-06-18 23:10:42,515 ERROR    Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/usr/local/seqr/bin/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in <module>
    run_pipeline()
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 867, in run_pipeline
    hc, vds = step0_init_and_run_vep(hc, vds, args)
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 162, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 431, in step0_init_and_run_vep
    vds = run_vep(vds, genome_version=args.genome_version, block_size=args.vep_block_size)
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/utils/vds_utils.py", line 136, in run_vep
    vds = vds.vep(config="/vep/vep-gcloud-grch{}.properties".format(genome_version), root=root, block_size=block_size)
  File "<decorator-gen-464>", line 2, in vep
  File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_builds/v01/hail-v01-10-8-2018-90c855449.zip/hail/java.py", line 127, in handle_py4j
hail.java.FatalError: An error occurred while calling into JVM, probably due to invalid parameter types.

Java stack trace:
An error occurred while calling o81.vep
Hail version: 0.1-105a497
Error summary: An error occurred while calling into JVM, probably due to invalid parameter types.
Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
^C
bw2 commented 5 years ago

Hi @Toseph , Please ignore the "Use of uninitialized value in hash element " message. VEP should still complete successfully. Let me know if you're also seeing a different error.

Toseph commented 5 years ago

How long should it take for a single VCF upload typically? My machine has 8 cores, and 32Gb of memory, and my VCF file is only 226Mb.

Stage 15 seems to be done, goes to stage 20, but it goes back to stage 6 and then 9.

See below,

VARS> line 1.
[Stage 15:=====================================================>(981 + 6) / 987]Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
[Stage 15:=====================================================>(982 + 5) / 987]Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
[Stage 15:=====================================================>(983 + 4) / 987]Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
[Stage 15:=====================================================>(985 + 2) / 987]Use of uninitialized value in hash element at /usr/local/seqr/seqr/vep/ensembl-tools-release-85/scripts/variant_effect_predictor/Bio/EnsEMBL/Variation/Utils/VEP.pm line 4255, <VARS> line 1.
[Stage 15:=====================================================>(986 + 1) / 987]2019-06-26 13:35:04 Hail: INFO: vep: annotated 4909438 variants
[Stage 20:>                                                       (0 + 8) / 987]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[Stage 20:>                                                       (7 + 8) / 987]quet.hadoop.ColumnChunkPageWriteStore: written 1,313B for [annotations, vep, variant_class] BINARY: 5,018 values, 1,268B raw, 1,273B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 3 entries, 32B raw, 3B comp}
[Stage 20:>                                                       (8 + 8) / 987]tations, vep, transcript_consequences, list, element, impact] BINARY: 32,339 values, 5,781B raw, 2,684B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 4 entries, 39B raw, 4B comp}
Jun 26, 2019 1:35:14 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 3,148B for [annotations, vep, transcript_consequences, list, element, uniparc] BINARY: 11,685 values, 10,609B raw, 3,091B comp, 1 pages, encodings: [RLE, PLAIN_DICTI[Stage 20:>                                                       (9 + 8) / 987], 130B comp}
[Stage 20:>                                                      (11 + 8) / 987]itten 5,067B for [annotations, vep, transcript_consequences, list, element, uniparc] BINARY: 19,917 values, 19,206B raw, 5,009B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 224 entries, 3,808B raw, 224B comp}
Jun 26, 2019 1:35:15 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 6,937B for [annotations, vep, transcr[Stage 20:>                                                      (15 + 8) / 987]N_DICTIONARY], dic { 306 entries, 5,814B raw, 306B comp}
[Stage 20:>                                                      (16 + 8) / 987]quet.hadoop.ColumnChunkPageWriteStore: written 1,823B for [annotations, vep, transcript_consequences, list, element, lof] BINARY: 39,895 values, 5,421B raw, 1,796B comp, 1 pages, encodings: [PLAIN, RLE]
[Stage 20:=>                                                     (18 + 8) / 987]ritten 802B for [annotations, vep, transcript_consequences, list, element, protein_end] INT32: 13,488 values, 2,127B raw, 763B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 40 entries, 160B raw, 40B comp}
[Stage 20:=>                                                     (20 + 8) / 987]eStore: written 2,005B for [annotations, vep, transcript_consequences, list, element, strand] INT32: 18,760 values, 3,756B raw, 1,967B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 2 entries, 8B raw, 2B comp}
Jun 26, 2019 1:35:18 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 2,211B for [annotations, vep, transcr[Stage 20:=>                                                     (22 + 8) / 987]odings: [RLE, PLAIN_DICTIONARY], dic { 148 entries, 2,812B raw, 148B comp}
Jun 26, 2019 1:35:18 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 867B for [annotations, vep, transcrip[Stage 20:=>                                                     (23 + 8) / 987]sequences, list, element, lof_flags] BINARY: 21,649 values, 2,798B raw, 840B comp, 1 pages, encodings: [PLAIN, RLE]
[Stage 20:=>                                                     (24 + 8) / 987]adoop.ColumnChunkPageWrit

[Stage 20:=====================================================>(973 + 8) / 987]957B raw, 386B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 3 entries, 24B raw, 3B comp}
[Stage 20:=====================================================>(974 + 8) / 987]INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 1,233B for [annotations, vep, transcript_consequences, list, element, hgvs_offset] INT32: 6,626 values, 1,665B raw, 1,196B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 40 entries, 160B raw, 40B comp}
[Stage 20:=====================================================>(977 + 8) / 987]riteStore: written 611B for [annotations, vep, transcript_consequences, list, element, hgvsp] BINARY: 7,150 values, 839B raw, 514B comp, 1 pages, encodings: [PLAIN, RLE]
Jun 26, 2019 1:39:44 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 3,874B for [annotations, vep, transcr[Stage 20:=====================================================>(980 + 7) / 987]anscript_id] BINARY: 9,642 values, 8,506B raw, 3,813B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 308 entries, 5,852B raw, 308B comp}
Jun 26, 2019 1:39:46 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 541B for [annotations, vep, transcrip[Stage 20:=====================================================>(981 + 6) / 987]p, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 7 entries, 77B raw, 7B comp}
[Stage 20:=====================================================>(985 + 2) / 987]unkPageWriteStore: written 406B for [annotations, vep, transcript_consequences, list, element, protein_end] INT32: 5,356 values, 428B raw, 369B comp, 1 pages, encodings: [PLAIN, RLE]
[Stage 20:=====================================================>(986 + 1) / 987]ns, vep, transcript_consequences, list, element, hgnc_id] BINARY: 8,091 values, 3,033B raw, 968B comp, 1 pages, encodings: [RLE, PLAIN_DICTIONARY], dic { 65 entries, 569B raw, 65B comp}
[Stage 3:=====================================================>(991 + 8) / 1000]2019-06-26 13:41:52 Hail: WARN: called redundant split on an already split VDS
[Stage 6:=====================================================>(999 + 1) / 1000]Struct{
[Stage 9:==>                                                    (49 + 8) / 1000]
Toseph commented 5 years ago

My run error'd out after running for over 24 hours after it reached pipeline step 2 to upload to elasticsearch.

Should I have tabix available in the path to ignore the VARS warning? I'm not really sure where to look here, but I have a 56Mb log of the upload and what steps it tried to take.

Just checked and tabix is in my PATH as a result of the install steps for seqr.

Toseph commented 5 years ago

This is the current error I am getting after step 2 begins

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 166, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=request_headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 344, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib64/python2.7/httplib.py", line 1041, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1075, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python2.7/httplib.py", line 1037, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 881, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 843, in send
    self.connect()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe7c1174810>: Failed to establish a new connection: [Errno 111] Connection refused
2019-06-28 08:00:31,943 WARNING  GET http://localhost:9200/ [status:N/A request:0.000s]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 166, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=request_headers, **kw)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/urllib3/util/retry.py", line 344, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib64/python2.7/httplib.py", line 1041, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1075, in _send_request
Toseph commented 5 years ago

Hey just wanted to say thanks for all the input on this ticket. We have seqr working on a new VM and it turned out to be a mix of python dependencies and elasticsearch that had to be worked through (ES version too high, and certain python packages needed an available version vs. specific ones).

I ultimately think the issue here is that I tried to deploy seqr as root instead of as user, on top of yum installing ES v7, so this caused issues with the ElasticSearch deployment that running it as a service did not resolve.