Closed jackgoldsmith4 closed 7 years ago
Ran the loader and query with the input files you provided. Made some corrections to the vid JSON. hail_vid.txt
However, there are issues with the sample VCF.
The 3rd issue is harder to fix. Some of the lines in the VCF are similar to:
20 1000 A T GT:PL ./.:.
In a "correct" VCF for GenomicsDB, the number of entries in the PL field should be equal to the number of genotypes (3 in the above example). If the PL field is missing, then the line should look like:
20 1000 A T GT:PL ./.:
The issue arises because of the ambiguity in the VCF spec in determining what's a missing field. For example, if a PL field has .,.,.
, is it a missing field or a valid field with missing/unknown values?
I fixed the first two issues with the VCF, and I fixed the vid file. The import worked, thanks!
The query/read fails because of issue 3 in the imported VCF.
I noticed that you import a VCF with multiple samples into GenomicsDB. Is that the expected mode of operation for Hail i.e. data will be imported from VCF file(s) where each file contains many (>1K) samples? Or is the common mode that you will have multiple VCF files each containing data for a single sample?
Hail won't generate genomics db files directly. The Data Sciences Data Engineering group at the Broad plans to deliver genomics db files to the Hail team instead of VCFs. Currently, they deliver VCFs containing a great number of samples.
Hail will import the genomics db file into our in-memory representation on which our users can write execute their analytical pipelines.
Can this issue be closed?
Hi. I'm working on the Hail Team at the Broad Institute, and I was trying to import a VCF into GenomicsDB, but it caused a segfault. Here is the VCF file that I tried to import. Attached are the three JSON files for this VCF. The error message that I got is below:
JSON: callsets.txt vid_mapping_file.txt loader_config_file.txt