Illumina / akt

Ancestry and Kinship Tools
GNU General Public License v3.0
70 stars 13 forks source link

Problem opening input.bcf -- No debug log #30

Open George3d6 opened 4 years ago

George3d6 commented 4 years ago

I converted a vcf to bcf and tried running your tool with the following command: ./akt pca -W data/wgs.grch37.vcf.gz input.bcf The only logs I get are:

Input: input.bcf
Using file data/wgs.grch37.vcf.gz for PCA weights
Problem opening input.bcf

Is see no debug option to make this message more verbose and figure out what the issue is, does a flag for verbose output exist ?

I don't believe the problem is permission related, here's the stats output for input.bcf:

  File: input.bcf
  Size: 557747163       Blocks: 1089360    IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 10640983    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  george)   Gid: ( 1000/  george)
Access: 2020-02-10 15:23:13.371180294 +0200
Modify: 2020-02-10 15:22:44.071013086 +0200
Change: 2020-02-10 15:22:44.071013086 +0200
 Birth: 
George3d6 commented 4 years ago

Note, I tried the same thing with a .vcf file and I get the exact same issue with the same amount of logs.

George3d6 commented 4 years ago

Trying this command instead: ./akt kin --force -M 1 input.bcf > kinship.txt , I now get this error message:

No frequency VCF provided (-F). Allele frequencies will be estimated from the data.
Problem opening input.bcf
Input file not found.

Which is even weirded, since the input.bcf file is most certainly present. using absolute paths doesn't seem to help.

jaredo commented 4 years ago

Apologies for the confusing error message. AKT requires indexed files so if you bcftools index input.bcf these problems should go away.

best,

Jared

George3d6 commented 4 years ago

Hmh,

It might be that I converted to bcf poorly, since I got different error after doing that.

However, I tried converting my original file (56001801065146A.snp.vcf) into and appropriate format via:

  1. bgzip 56001801065146A.snp.vcf
  2. bcftools index 56001801065146A.snp.vcf.gz

However upon running: ./akt pca -W data/wgs.grch37.vcf.gz 56001801065146A.snp.vcf.gz I now got the error:

Input: 56001801065146A.snp.vcf.gz
Using file data/wgs.grch37.vcf.gz for PCA weights
1 samples
Using 20 PCs from input file.
0/17491 of sites were in 56001801065146A.snp.vcf.gz
ERROR: less that 90% of sites in data/wgs.grch37.vcf.gz were NOT in data/wgs.grch37.vcf.gz

(Same issue if I use --assume-homref)

Is this to be expected if my vcf file only contains full genome sequence data and not mitochondrial DNA data ?

It does contains 150 or so SNPs that are Y-chromosome haplogroup related, so I assumed this would be correct.

Or might there be something wrong with he way I did my indexing ?

jaredo commented 4 years ago

wgs.grch37.vcf.gz contains 17,491 common autosomal variants that should be detected in any high coverage whole genome sequenced human (excluding homozygous reference). It won't matter if MT/X/Y variants are you in your VCF, they will just be ignored.

What reference genome are you using? You need this to be consistent with the version in -W vcf, there are loading VCFs included for hg19 and hg39 (both with and without the chr prefix).

George3d6 commented 4 years ago

I am using a VCF I got from datnte's lab ~10 months ago. Is there a standard way to check the "versioning" on those ? I'm not to familiar with the file format to be honest, every time I think I understand how it works something pops up and I realize I don't.

George3d6 commented 4 years ago

I seem to have gotten matches on some (130 sties) with data/wgs.hg38.vcf.gz and on 9960/17491 with data/wgs.hg19.vcf.gz,

Do you have any further documentation that explains the difference between the files and why matches might be found only on some of those ?

Anyway, thanks for all the help, hopefully I can handle the rest from here :)