Illumina / hap.py

Haplotype VCF comparison tools
Other
402 stars 122 forks source link

segmentation fault at gvcf2bed #154

Open meredith705 opened 2 years ago

meredith705 commented 2 years ago

Hello,

I'm running the hap.py v0.3.12 docker using the Genome In A Bottle HG002 vs GRCh38 truth set using the files below and getting a segmentation fault error in the gvcf2bed program.

The GIAB bottle truth set is here https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/

The vcf I'm using as truth is: HG002_GRCh38_1_22_v4.2.1_benchmark_phased_MHCassembly_StrandSeqANDTrio.vcf from their supplementary files here

High confidence bed file: HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed file from the same directory as above

The GRCh38 reference is: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz

The query vcf I'm using is created by running dipcall on two haploid assemblies of chr11 ( a small test assembly ) using the same GRCh38 assembly as a reference: paternal.dip.vcf.gz

Here is my command:

sudo docker run -it -u `id -u $USER`:`id -g $USER` -v `pwd`:/data jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/data/hg38/GIAB_HG002_GRCh38_1_22_v4.2.1_phased.vcf \
/data/wdl_scripts/paternal.dip.vcf.gz \
-f /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz \
-o /data/happy/chr11 \
--pass-only --no-roc --no-json --engine=vcfeval \
--threads=10

I get a segmentation fault error after about 2 minutes of run time.

2022-03-28 18:17:44,433 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py 
[W] overlapping records at chr6:29747433 for sample 0
[W] Variants that overlap on the reference allele: 5
[I] Total VCF records:         1759115
[I] Non-reference VCF records: 1759115
Segmentation fault
2022-03-28 18:19:58,760 ERROR    Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139
2022-03-28 18:19:58,763 ERROR    Traceback (most recent call last):
2022-03-28 18:19:58,763 ERROR      File "/opt/hap.py/bin/hap.py", line 529, in <module>
2022-03-28 18:19:58,764 ERROR        main()
2022-03-28 18:19:58,764 ERROR      File "/opt/hap.py/bin/hap.py", line 314, in main
2022-03-28 18:19:58,765 ERROR        conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2022-03-28 18:19:58,765 ERROR      File "/opt/hap.py/lib/python27/Haplo/gvcf2bed.py", line 39, in gvcf2bed
2022-03-28 18:19:58,765 ERROR        subprocess.check_call(cmdline, shell=True)
2022-03-28 18:19:58,765 ERROR      File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
2022-03-28 18:19:58,776 ERROR        raise CalledProcessError(retcode, cmd)
2022-03-28 18:19:58,776 ERROR    CalledProcessError: Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139

I really appreciate your help figuring out this issue, please let me know what other information I can provide.

Thank you in advance, Melissa

YuanfengZhang commented 2 years ago

Using fasta file from alternative sources will help you get rid of this.

KatharineME commented 2 years ago

I am also getting this error. But what is it about the fasta file thats causing this error? Any more information that can help debug this?

Thanks.

KatharineME commented 2 years ago

Did you figure this out @meredith705? Looks like we are doing something very similar.

meredith705 commented 2 years ago

@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.

The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

YuanfengZhang commented 2 years ago

@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.

The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

(1) Try GDC/BWA/GATK/GA4GH/other sources. (2) Someone knowing cpp is need to handle this.

KatharineME commented 2 years ago

Hi All, I got hap.py working without changing the reference genome. I did the following. I'm assuming that you are using rtg as the vcf comparison engine:

1) Create genome sdf with rtg

rtg format -o path_to_sdf path_to_decompressed_reference_genome`

2) Get the pkrusche/hap.py docker container with latest tag

3) Run the hap.py docker container like so

docker run --interactive --detach --tty --user root -v local_dir_with_all_needed_files:dir_in_container pkrusche/hap.py bash

4) Calling hap.py like so. Notice that the reference genome passed is decompressed. Remember that all paths in the command are paths in the container.

docker exec --interactive docker_container_id bash -c "/opt/hap.py/bin/hap.py path_to_truth_vcf path_to_query_vcf -f path_to_confident_regions_bed -r path_to_decompressed_reference_genome -o output_dir --engine-vcfeval-path path_to_rtg_folder --engine-vcfeval-template path_to_genome_sdf" 

If this doesn't work for you, I recommend trying to run the examples from the readme in the docker container.

barslmn commented 1 year ago

I had the same issue. Decompressing the reference fastq.gz and reindexing it with samtools faidx solved it for me. My other steps are here.