Intel-HLS / GenomicsDB

GenomicsDB
Other
111 stars 28 forks source link

reference genome file in query.json #137

Closed jackgoldsmith4 closed 7 years ago

jackgoldsmith4 commented 7 years ago

Hi, I am wondering how the .fasta reference genome file is used in genoimcsDB. It is a large file, and it is set as a mandatory parameter in the docs. However, my tests pass without this file, and all I get is a warning that it could not be opened. What is the purpose for this file in genomicsDB?

kgururaj commented 7 years ago

It's used to get the reference base when a genomic interval 'breaks' when combining VCFs (or gVCFs). Example: File t0.vcf

chr1 100   END=500

File t1.vcf

chr1 100   END=150
chr1 600   END=700

The combined VCF records will look like:

chr1 100   END=150
chr1 151   END=500
chr1 600   END=700

Now, the reference base at position chr1:151 is unknown and can be obtained only from the reference genome. Such a scenario might occur for deletions as well (spanning deletions).

I'm surprised the program didn't fail - perhaps the test case doesn't hit the scenario described above.

jackgoldsmith4 commented 7 years ago

This makes sense. However, Hail is not planning to use genomicsDB to combine VCFs. Thanks!