Closed Toseph closed 4 years ago
Can you please send me the command you ran to generate that index?
Hi Hanars,
I generated the index with the usual parameters but upped the driver memory and executor memory. This worked successfully for 3 of my other vcf files, but not the final 3.
python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --cpu-limit 8 $INPUT_VCF
As an example from one of my capture logs,
[thasan@seqr03 hail_elasticsearch_pipelines]$ cat GEN14-05M-D.log | grep creating 2019-09-13 07:16:41,141 INFO ==> creating elasticsearch index r0001_project1__wes__grch37__variants__20190912 [thasan@seqr03 hail_elasticsearch_pipelines]$
The variant dated with 2019012 or the one in the original screenshot with 20190909 are both throwing the same error.
You need to also include --use-nested-objects-for-genotypes
when you run the pipeline
Hi @hanars, sorry for the delay between replies. That line did fix adding GEN14-04-D.genotyped.gatk3.5.vcf.gz, but should I remove it to upload parent data from GEN14-04M-D.genotyped.gatk3.5.vcf.gz and GEN14-04F-D.genotyped.gatk3.5.vcf.gz or would it help to use "--use-child-docs-for-genotypes" in this case on all three and re-upload them?
I realized the sample name in the screenshots don't match up but that's the error I am getting regardless of the parent in 04 or 05 while uploading it against their child's data.
All individuals in a family need to be included in the same index. You can use a comma-seperted list of files as your input if you have multiple VCFs representing one family. We joint call all our samples and start with a single VCF with all samples
@hanars, where does this comma-separated list come into play? Would it be during the upload on CLI,
e.g.
python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --use-nested-objects-for-vep --use-nested-objects-for-genotypes --cpu-limit 8 file1.vcf, file2.vcf file3.vcf
or should I place the three file names in a file such as "input.list" and pass input.list as a parameter?
Alternatively, is this what the ID Mapping File Path section is used for on the web-GUI? I know this ticket has gone beyond the scope of the original issue, but I haven't really found documentation to clarify how to upload multiple vcf's belonging to a family.
that example looks correct, except you wouldn't put spaces between the files so it would be file1.vcf,file2.vcf,file3.vcf
The ID mapping file is if you have a mismatch between the IDs in your VCF and the ones in seqr
@hanars I'm not sure what I'm doing wrong here. I can pass in a single file as an argument while running in local-mode, but it won't accept multiple files on the line. Just to mask it better for git, I copied the VCF file names over as file1.vcf.gz, file2.vcf.gz, file3.vcf.gz
The error I am getting now is "HailException: arguments refer to no files" even when I use the full path to each vcf.gz file. When I pass them as individual arguments though (just one file), it works, but when I pass in multiple comma-separated files I have to set prefix for "--output-vcf" that led to the no files error.
See my bash script below. I tried running it not as a script and it fails the same way.
#!/bin/bash
`GENOME_VERSION="37" # should be "37" or "38" SAMPLE_TYPE="WES" # can be "WES" or "WGS" DATASET_TYPE="VARIANTS" # can be "VARIANTS" (for GATK VCFs) or "SV" (for Manta VCFs) PROJECT_GUID="R0001_project1" # should match the ID in the url of the project page INPUT_VCF=file1.vcf.gz,file2.vcf.gz,file3.vcf.gz
python2.7 gcloud_dataproc/submit.py --run-locally --driver-memory 40G --executor-memory 40G hail_scripts/v01/load_dataset_to_es.py --spark-home $SPARK_HOME --genome-version $GENOME_VERSION --project-guid $PROJECT_GUID --sample-type $SAMPLE_TYPE --dataset-type $DATASET_TYPE --skip-validation --exclude-hgmd --vep-block-size 100 --es-block-size 10 --num-shards 1 --hail-version 0.1 --output-vds "GEN14" --use-nested-objects-for-vep --use-nested-objects-for-genotypes --cpu-limit 8 $INPUT_VCF
Then it fails out as such
[thasan@seqr03 hail_elasticsearch_pipelines]$ ./multi-upload.sh /usr/local/seqr/seqr/../bin/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master local[8] --driver-memory 40G --executor-memory 40G --num-executors 10 --conf spark.driver.extraJavaOptions=-Xss4M --conf spark.executor.extraJavaOptions=-Xss4M --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=30g --conf spark.kryoserializer.buffer.max=1g --conf spark.memory.fraction=0.1 --conf spark.default.parallelism=1 --jars hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.driver.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --conf spark.executor.extraClassPath=hail_builds/v01/hail-v01-10-8-2018-90c855449.jar --py-files hail_builds/v01/hail-v01-10-8-2018-90c855449.zip "hail_scripts/v01/load_dataset_to_es.py" "--genome-version" "37" "--project-guid" "R0001_project1" "--sample-type" "WES" "--dataset-type" "VARIANTS" "--skip-validation" "--exclude-hgmd" "--vep-block-size" "100" "--es-block-size" "10" "--num-shards" "1" "--output-vds" "GEN14" "--use-nested-objects-for-vep" "--use-nested-objects-for-genotypes" "file1.vcf.gz,file2.vcf.gz,file3.vcf.gz" --username 'thasan' --directory 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines'
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: elasticsearch in /usr/lib/python2.7/site-packages (7.0.4) Requirement already satisfied: urllib3>=1.21.1 in /usr/lib/python2.7/site-packages (from elasticsearch) (1.25.3) WARNING: You are using pip version 19.2.3, however version 19.3.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. 2019-10-24 13:25:44,312 INFO Index name: r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,312 INFO Command args: /usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py --index r0001_project1__wes__grch37__variants__20191024 2019-10-24 13:25:44,315 INFO Parsed args:
{'cpu_limit': None, 'create_snapshot': False, 'dataset_type': 'VARIANTS', 'directory': 'seqr03.nygenome.org:/usr/local/seqr/seqr/hail_elasticsearch_pipelines', 'discard_missing_genotypes': False, 'dont_delete_intermediate_vds_files': False, 'dont_update_operations_log': False, 'es_block_size': 10, 'exclude_1kg': False, 'exclude_cadd': False, 'exclude_clinvar': False, 'exclude_dbnsfp': False, 'exclude_eigen': False, 'exclude_exac': False, 'exclude_gene_constraint': False, 'exclude_gnomad': False, 'exclude_gnomad_coverage': False, 'exclude_hgmd': True, 'exclude_mpc': False, 'exclude_omim': False, 'exclude_primate_ai': False, 'exclude_splice_ai': False, 'exclude_topmed': False, 'exclude_vcf_info_field': False, 'export_vcf': False, 'fam_file': None, 'family_id': None, 'filter_interval': '1-MT', 'genome_version': '37', 'host': 'localhost', 'ignore_extra_sample_ids_in_tables': False, 'ignore_extra_sample_ids_in_vds': False, 'index': 'r0001_project1__wes__grch37__variants__20191024', 'individual_id': None, 'input_dataset': 'file1.vcf.gz,file2.vcf.gz,file3.vcf.gz', 'max_samples_per_index': 250, 'not_gatk_genotypes': False, 'num_shards': 1, 'only_export_to_elasticsearch_at_the_end': False, 'output_vds': 'GEN14', 'port': '9200', 'project_guid': 'R0001_project1', 'remap_sample_ids': None, 'sample_type': 'WES', 'skip_annotations': False, 'skip_validation': True, 'skip_vep': False, 'skip_writing_intermediate_vds': False, 'start_with_sample_group': 0, 'start_with_step': 0, 'stop_after_step': 1000, 'subset_samples': None, 'use_child_docs_for_genotypes': False, 'use_nested_objects_for_genotypes': True, 'use_nested_objects_for_vep': True, 'use_temp_loading_nodes': False, 'username': 'thasan', 'vep_block_size': 100} 2019-10-24 13:25:44,315 INFO ==> create HailContext Running on Apache Spark version 2.0.2 SparkUI available at http://10.1.27.167:4040 Welcome to __ __ <>__ / /_/ /__ __/ / / __ / _
/ / /
// //_,//_/ version 0.1-105a497
2019-10-24 13:25:46,449 INFO is_running_locally = True
2019-10-24 13:25:46,449 INFO`
=============================== pipeline - step 0 - run vep =============================== 2019-10-24 13:25:46,449 INFO ==> import: file1.vcf.gz,file2.vcf.gz,file3.vcf.gz 2019-10-24 13:25:46 Hail: WARN:
file1.vcf.gz,file2.vcf.gz,file3.vcf.gz' refers to no files
Traceback (most recent call last):
File "/usr/local/seqr/seqr/hail_elasticsearch_pipelines/hail_scripts/v01/load_dataset_to_es.py", line 900, in
Java stack trace: is.hail.utils.HailException: arguments refer to no files at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6) at is.hail.utils.package$.fatal(package.scala:27) at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:105) at is.hail.HailContext.importVCFs(HailContext.scala:544) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)
Hail version: 0.1-105a497
Error summary: HailException: arguments refer to no files
Traceback (most recent call last):
File "gcloud_dataproc/submit.py", line 99, in
Hmm try wildcard syntax? I.e. "file*.vcf.gz"
@hanars , it started with the wildcard on it, but input_dataset only refers to one file at the moment. 'input_dataset': 'file1.vcf.gz',
I'll let it run tonight and get back to you. Thanks as always for keeping an eye on this!
Hey hanars, while I was able to get seqr to accept the wildcard input, the ES search index generated seemed to only apply against a single vcf still.
To test this another way, is there publicly available family-vcf datasets I can try downloading and testing on our deployment?
Alternatively, is there a way I can test uploading this to broad's seqr portal to ensure it isn't our deployment at fault?
I'm wondering too if hail 0.2 will have better support for mutli-vcf upload, even though it doesn't support local installs currently (but I could be wrong about that).
We have gs://seqr-reference-data/test-projects/1kg.vcf.gz as a publicly available dataset. I would recommend you make a new test project and then you can get the pedigree info set by using "Bulk upload individual" with this file: gs://seqr-reference-data/test-projects/1kg.ped Running the pipeline on that VCF should be fine, so that is a good way to test if things are working. There is no way I can grant you access to the Broad's seqr, so unfortunately thats not an option.
Hail 0.2 should have better support for multi project VCFs. If you want to be the guinea pig, it may work. Theres a branch here that has the updates you should need, but full warning its still in progress: https://github.com/macarthur-lab/seqr/tree/local_hail_v02 To use it you would need to pull from that branch, re-run deploy/install_local.step8.install_pipeline_runner.sh from that branch and then the command you would use to run the pipeline is
python3 -u gcloud_dataproc/submit.py --run-locally --hail-version 0.2 luigi_pipeline/seqr_loading.py SeqrVCFToMTTask --local-scheduler --genome-version $GENOME_VERSION --sample-type $SAMPLE_TYPE --source-paths $INPUT_VCF --dest-path test.mt
And then I think the way you would set up the multiple VCFS is by setting
INPUT_VCF=["file1.vcf.gz","file2.vcf.gz","file3.vcf.gz"]
Hello, I need some assistance figuring out why 3 of my created indexes will not submit properly to the seqr web interface. The upload process works as far as generating an index goes and creating the related files, but when I attempt to edit database and add that index, I see it does not have a valid schema" when attempting to edit database and add it even though family and individual ID's seem to match.
Is there a way I can insure the individual ID/schema match between my vcf file and what I see on seqr's web ui? Is there an example of what a test individual_id and family_id csv files should look like so I can use that as an example for mine? I can also open up up the vcf with bcftools or samtools if that will help uncover these mismatched areas.