cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

vcf to hdf5 tabix warning #292

Closed Homap closed 4 years ago

Homap commented 4 years ago

Hello,

Thank you for the wonderful package.

When trying to convert vcf to hdf5, using the following:

chr_l = ['chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10',
'chr11','chr12','chr13','chr14','chr15','chr17','chr18','chr19','chr20','chr21',
'chr22','chr23','chr24','chr25','chr26','chr27','chr28','chr1A','chrLGE22','chrZ']
for element in chr_l:
    allel.vcf_to_hdf5('../data/file.vcf.gz', "../data/file.h5.gz", compression='gzip', fields='*', overwrite=True, region=element, group=element)

I get the following warning:

/pathto/envs/popgene/lib/python3.7/site-packages/allel/io/vcf_read.py:1051: UserWarning: tabix not found, falling back to scanning to region
  warnings.warn('tabix not found, falling back to scanning to region')

I was wondering what does this mean and how I could solve it?

Thank you!

alimanfoo commented 4 years ago

Hi @Homap,

UserWarning: tabix not found, falling back to scanning to region
  warnings.warn('tabix not found, falling back to scanning to region')

This means that tabix is not installed on the system you are running on, which means that scikit-allel has to scan through the VCF to find the data for a given chromosome, which can be slow. If you are able to install tabix then it will be faster, because scikit-allel can use tabix to jump into the VCF file to the location where the data are for a given chromosome.

If you are on windows then there is no way to install tabix. If you are on linux or max then you can install tabix various ways, e.g., via bioconda (I think it's part of the htslib package).

Hth.

Homap commented 4 years ago

Hi,

Thank you so much. I installed tabixvia bionconda and it's now under:

/pathto/envs/popgene/bin/tabix

I tried again, and I got almost the same warning:

/pathto/envs/popgene/lib/python3.7/site-packages/allel/io/vcf_read.py:1057: UserWarning: error occurred attempting tabix ([tabix] failed to load the index file.); falling back to scanning to region
  'scanning to region' % e)

Thanks a lot for your help! Homa

alimanfoo commented 4 years ago

Hi Homa, now you need to tabix index your VCF file. E.g., if I remember rightly, run:

tabix -p vcf /path/to/your/file.vcf.gz

That should create an index file. Then scikit-allel warning should go away.

On Fri, 18 Oct 2019, 18:28 Homa Papoli, notifications@github.com wrote:

Hi,

Thank you so much. I installed tabix via bionconda and it's now under:

/pathto/envs/popgene/bin/tabix

I tried again, and I got almost the same warning:

/pathto/envs/popgene/lib/python3.7/site-packages/allel/io/vcf_read.py:1057: UserWarning: error occurred attempting tabix ([tabix] failed to load the index file.); falling back to scanning to region 'scanning to region' % e)

Thanks a lot for your help! Homa

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/292?email_source=notifications&email_token=AAFLYQQQRJ66AVIUI7VNLOTQPHW37A5CNFSM4JCJPQ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBVHXZY#issuecomment-543849447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLYQXW52D4TC4X4UDYKOTQPHW37ANCNFSM4JCJPQ6Q .