broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
176 stars 88 forks source link

No Valid Schema during submission of a Created ES Index #1037

Closed Toseph closed 4 years ago

Toseph commented 4 years ago

Hello, I need some assistance figuring out why 3 of my created indexes will not submit properly to the seqr web interface. The upload process works as far as generating an index goes and creating the related files, but when I attempt to edit database and add that index, I see it does not have a valid schema" when attempting to edit database and add it even though family and individual ID's seem to match.

Is there a way I can insure the individual ID/schema match between my vcf file and what I see on seqr's web ui? Is there an example of what a test individual_id and family_id csv files should look like so I can use that as an example for mine? I can also open up up the vcf with bcftools or samtools if that will help uncover these mismatched areas.

Screen Shot 2019-09-25 at 1 47 15 PM
hanars commented 4 years ago

Can you please send me the command you ran to generate that index?

Toseph commented 4 years ago

Hi Hanars, I generated the index with the usual parameters but upped the driver memory and executor memory. This worked successfully for 3 of my other vcf files, but not the final 3. python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --cpu-limit 8 $INPUT_VCF

As an example from one of my capture logs, [thasan@seqr03 hail_elasticsearch_pipelines]$ cat GEN14-05M-D.log | grep creating 2019-09-13 07:16:41,141 INFO ==> creating elasticsearch index r0001_project1__wes__grch37__variants__20190912 [thasan@seqr03 hail_elasticsearch_pipelines]$

Toseph commented 4 years ago

The variant dated with 2019012 or the one in the original screenshot with 20190909 are both throwing the same error.

hanars commented 4 years ago

You need to also include --use-nested-objects-for-genotypes when you run the pipeline

Toseph commented 4 years ago

Hi @hanars, sorry for the delay between replies. That line did fix adding GEN14-04-D.genotyped.gatk3.5.vcf.gz, but should I remove it to upload parent data from GEN14-04M-D.genotyped.gatk3.5.vcf.gz and GEN14-04F-D.genotyped.gatk3.5.vcf.gz or would it help to use "--use-child-docs-for-genotypes" in this case on all three and re-upload them?

Screen Shot 2019-10-15 at 1 14 47 PM Screen Shot 2019-10-15 at 1 15 25 PM Screen Shot 2019-10-15 at 1 36 32 PM
Toseph commented 4 years ago

I realized the sample name in the screenshots don't match up but that's the error I am getting regardless of the parent in 04 or 05 while uploading it against their child's data.

hanars commented 4 years ago

All individuals in a family need to be included in the same index. You can use a comma-seperted list of files as your input if you have multiple VCFs representing one family. We joint call all our samples and start with a single VCF with all samples

Toseph commented 4 years ago

@hanars, where does this comma-separated list come into play? Would it be during the upload on CLI,

e.g. python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --use-nested-objects-for-genotypes --cpu-limit 8 file1.vcf, file2.vcf file3.vcf or should I place the three file names in a file such as "input.list" and pass input.list as a parameter?

Alternatively, is this what the ID Mapping File Path section is used for on the web-GUI? I know this ticket has gone beyond the scope of the original issue, but I haven't really found documentation to clarify how to upload multiple vcf's belonging to a family.

hanars commented 4 years ago

that example looks correct, except you wouldn't put spaces between the files so it would be file1.vcf,file2.vcf,file3.vcf

The ID mapping file is if you have a mismatch between the IDs in your VCF and the ones in seqr

Toseph commented 4 years ago

@hanars I'm not sure what I'm doing wrong here. I can pass in a single file as an argument while running in local-mode, but it won't accept multiple files on the line. Just to mask it better for git, I copied the VCF file names over as file1.vcf.gz, file2.vcf.gz, file3.vcf.gz

The error I am getting now is "HailException: arguments refer to no files" even when I use the full path to each vcf.gz file. When I pass them as individual arguments though (just one file), it works, but when I pass in multiple comma-separated files I have to set prefix for "--output-vcf" that led to the no files error.

See my bash script below. I tried running it not as a script and it fails the same way.

#!/bin/bash

`GENOME_VERSION="37" # should be "37" or "38" SAMPLE_TYPE="WES" # can be "WES" or "WGS" DATASET_TYPE="VARIANTS" # can be "VARIANTS" (for GATK VCFs) or "SV" (for Manta VCFs) PROJECT_GUID="R0001_project1" # should match the ID in the url of the project page INPUT_VCF=file1.vcf.gz,file2.vcf.gz,file3.vcf.gz

python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --output-vds "GEN14" --use-nested-objects-for-vep --use-nested-objects-for-genotypes --cpu-limit 8 $INPUT_VCF

Then it fails out as such

[thasan@seqr03 hail_elasticsearch_pipelines]$ ./multi-upload.sh /usr/local/seqr/seqr/../bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master local[8] --driver-memory 40G --executor-memory 40G --num-executors 10 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--output-vds" "GEN14" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "file1.vcf.gz,file2.vcf.gz,file3.vcf.gz" --username 'thasan' --directory 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines'

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: elasticsearch in /usr/lib/python2.7/site-packages (7.0.4) Requirement already satisfied: urllib3>=1.21.1 in /usr/lib/python2.7/site-packages (from elasticsearch) (1.25.3) WARNING: You are using pip version 19.2.3, however version 19.3.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. 2019-10-24 13:25:44,312 INFO Index name: r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,312 INFO Command args: /usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py --index r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,315 INFO Parsed args: {'cpu_limit': None, 'create_snapshot': False, 'dataset_type': 'VARIANTS', 'directory': 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines', 'discard_missing_genotypes': False, 'dont_delete_intermediate_vds_files': False, 'dont_update_operations_log': False, 'es_block_size': 10, 'exclude_1kg': False, 'exclude_cadd': False, 'exclude_clinvar': False, 'exclude_dbnsfp': False, 'exclude_eigen': False, 'exclude_exac': False, 'exclude_gene_constraint': False, 'exclude_gnomad': False, 'exclude_gnomad_coverage': False, 'exclude_hgmd': True, 'exclude_mpc': False, 'exclude_omim': False, 'exclude_primate_ai': False, 'exclude_splice_ai': False, 'exclude_topmed': False, 'exclude_vcf_info_field': False, 'export_vcf': False, 'fam_file': None, 'family_id': None, 'filter_interval': '1-MT', 'genome_version': '37', 'host': 'localhost', 'ignore_extra_sample_ids_in_tables': False, 'ignore_extra_sample_ids_in_vds': False, 'index': 'r0001_project1__wes__grch37__variants__20191024', 'individual_id': None, 'input_dataset': 'file1.vcf.gz,file2.vcf.gz,file3.vcf.gz', 'max_samples_per_index': 250, 'not_gatk_genotypes': False, 'num_shards': 1, 'only_export_to_elasticsearch_at_the_end': False, 'output_vds': 'GEN14', 'port': '9200', 'project_guid': 'R0001_project1', 'remap_sample_ids': None, 'sample_type': 'WES', 'skip_annotations': False, 'skip_validation': True, 'skip_vep': False, 'skip_writing_intermediate_vds': False, 'start_with_sample_group': 0, 'start_with_step': 0, 'stop_after_step': 1000, 'subset_samples': None, 'use_child_docs_for_genotypes': False, 'use_nested_objects_for_genotypes': True, 'use_nested_objects_for_vep': True, 'use_temp_loading_nodes': False, 'username': 'thasan', 'vep_block_size': 100} 2019-10-24 13:25:44,315 INFO ==> create HailContext Running on Apache Spark version 2.0.2 SparkUI available at http://10.1.27.167:4040 Welcome to __ __ <>__ / /_/ /__ __/ / / __ / _/ / / // //_,//_/ version 0.1-105a497 2019-10-24 13:25:46,449 INFO is_running_locally = True 2019-10-24 13:25:46,449 INFO`

=============================== pipeline - step 0 - run vep =============================== 2019-10-24 13:25:46,449 INFO ==> import: file1.vcf.gz,file2.vcf.gz,file3.vcf.gz 2019-10-24 13:25:46 Hail: WARN:file1.vcf.gz,file2.vcf.gz,file3.vcf.gz' refers to no files Traceback (most recent call last): File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in run_pipeline() File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 867, in run_pipeline hc, vds = step0_init_and_run_vep(hc, vds, args) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 162, in wrapper result = f(*args, **kwargs) File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 419, in step0_init_and_run_vep not_gatk_genotypes=args.not_gatk_genotypes, File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/utils/vds_utils.py", line 67, in read_in_dataset vds = hc.import_vcf(input_path, force_bgz=True, min_partitions=10000, generic=not_gatk_genotypes) File "", line 2, in import_vcf File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_builds/v01/hail-v01-10-8-2018-90c855449.zip/hail/java.py", line 121, in handle_py4j hail.java.FatalError: HailException: arguments refer to no files`

Java stack trace: is.hail.utils.HailException: arguments refer to no files at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6) at is.hail.utils.package$.fatal(package.scala:27) at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:105) at is.hail.HailContext.importVCFs(HailContext.scala:544) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-105a497 Error summary: HailException: arguments refer to no files Traceback (most recent call last): File "gcloud_dataproc/submit.py", line 99, in subprocess.check_call(command, shell=True) File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/local/seqr/seqr/../bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master local[8] --driver-memory 40G --executor-memory 40G --num-executors 10 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--output-vds" "GEN14" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "file1.vcf.gz,file2.vcf.gz,file3.vcf.gz" --username 'thasan' --directory 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines' ' returned non-zero exit status 1`

hanars commented 4 years ago

Hmm try wildcard syntax? I.e. "file*.vcf.gz"

Toseph commented 4 years ago

@hanars , it started with the wildcard on it, but input_dataset only refers to one file at the moment. 'input_dataset': 'file1.vcf.gz',

I'll let it run tonight and get back to you. Thanks as always for keeping an eye on this!

Toseph commented 4 years ago

Hey hanars, while I was able to get seqr to accept the wildcard input, the ES search index generated seemed to only apply against a single vcf still.

To test this another way, is there publicly available family-vcf datasets I can try downloading and testing on our deployment?

Alternatively, is there a way I can test uploading this to broad's seqr portal to ensure it isn't our deployment at fault?

I'm wondering too if hail 0.2 will have better support for mutli-vcf upload, even though it doesn't support local installs currently (but I could be wrong about that).

hanars commented 4 years ago

We have gs://seqr-reference-data/test-projects/1kg.vcf.gz as a publicly available dataset. I would recommend you make a new test project and then you can get the pedigree info set by using "Bulk upload individual" with this file: gs://seqr-reference-data/test-projects/1kg.ped Running the pipeline on that VCF should be fine, so that is a good way to test if things are working. There is no way I can grant you access to the Broad's seqr, so unfortunately thats not an option.

Hail 0.2 should have better support for multi project VCFs. If you want to be the guinea pig, it may work. Theres a branch here that has the updates you should need, but full warning its still in progress: https://github.com/macarthur-lab/seqr/tree/local_hail_v02 To use it you would need to pull from that branch, re-run deploy/install_local.step8.install_pipeline_runner.sh from that branch and then the command you would use to run the pipeline is

python3 -u gcloud_dataproc/submit.py --run-locally --hail-version 0.2  luigi_pipeline/seqr_loading.py SeqrVCFToMTTask --local-scheduler --genome-version $GENOME_VERSION --sample-type $SAMPLE_TYPE --source-paths  $INPUT_VCF --dest-path test.mt

And then I think the way you would set up the multiple VCFS is by setting

INPUT_VCF=["file1.vcf.gz","file2.vcf.gz","file3.vcf.gz"]