Segfault when importing VCF

jackgoldsmith4 commented 7 years ago

Hi. I'm working on the Hail Team at the Broad Institute, and I was trying to import a VCF into GenomicsDB, but it caused a segfault. Here is the VCF file that I tried to import. Attached are the three JSON files for this VCF. The error message that I got is below:

ubuntu@ip-172-31-23-20:~/build_dir/tools$ vcf2tiledb /home/ubuntu/build_dir/jsonFiles/loader_config_file.json

[[19760,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces:

Module: OpenFabrics (openib) Host: ip-172-31-23-20

Another transport will be used instead, although this may result in lower performance.

[ip-172-31-23-20:04729] Process received signal [ip-172-31-23-20:04729] Signal: Segmentation fault (11) [ip-172-31-23-20:04729] Signal code: Address not mapped (1) [ip-172-31-23-20:04729] Failing at address: 0x1488780 [ip-172-31-23-20:04729] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4582f36390] [ip-172-31-23-20:04729] [ 1] vcf2tiledb[0x507b29] [ip-172-31-23-20:04729] [ 2] vcf2tiledb[0x4d9fd2] [ip-172-31-23-20:04729] [ 3] vcf2tiledb[0x4eca80] [ip-172-31-23-20:04729] [ 4] vcf2tiledb[0x4eddd8] [ip-172-31-23-20:04729] [ 5] vcf2tiledb[0x48d704] [ip-172-31-23-20:04729] [ 6] vcf2tiledb[0x490c7f] [ip-172-31-23-20:04729] [ 7] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x16dfe)[0x7f458336fdfe] [ip-172-31-23-20:04729] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f4582f2c6ba] [ip-172-31-23-20:04729] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f4582c623dd] [ip-172-31-23-20:04729] End of error message Segmentation fault (core dumped)

JSON: callsets.txt vid_mapping_file.txt loader_config_file.txt

kgururaj commented 7 years ago

Ran the loader and query with the input files you provided. Made some corrections to the vid JSON. hail_vid.txt

However, there are issues with the sample VCF.

The header does not have contig "20". Added a line in the header (consistent with the vid file).
Normalized the VCF
The 3rd issue is harder to fix. Some of the lines in the VCF are similar to:

20 1000 A T GT:PL ./.:.

In a "correct" VCF for GenomicsDB, the number of entries in the PL field should be equal to the number of genotypes (3 in the above example). If the PL field is missing, then the line should look like:

20 1000 A T GT:PL ./.:

The issue arises because of the ambiguity in the VCF spec in determining what's a missing field. For example, if a PL field has .,.,., is it a missing field or a valid field with missing/unknown values?

jackgoldsmith4 commented 7 years ago

I fixed the first two issues with the VCF, and I fixed the vid file. The import worked, thanks!

kgururaj commented 7 years ago

The query/read fails because of issue 3 in the imported VCF.

I noticed that you import a VCF with multiple samples into GenomicsDB. Is that the expected mode of operation for Hail i.e. data will be imported from VCF file(s) where each file contains many (>1K) samples? Or is the common mode that you will have multiple VCF files each containing data for a single sample?

danking commented 7 years ago

Hail won't generate genomics db files directly. The Data Sciences Data Engineering group at the Broad plans to deliver genomics db files to the Hail team instead of VCFs. Currently, they deliver VCFs containing a great number of samples.

Hail will import the genomics db file into our in-memory representation on which our users can write execute their analytical pipelines.

kgururaj commented 7 years ago

Can this issue be closed?

Intel-HLS / GenomicsDB

Segfault when importing VCF #123