brentp / cyvcf2

cython + htslib == fast VCF and BCF processing
MIT License
366 stars 71 forks source link

Set a custom index file path #154

Closed IsmailM closed 3 years ago

IsmailM commented 4 years ago

Hi,

Just a quick question. is it possible to set a path to the Tabix index file?

I have tried the following:

from cyvcf2 import VCF
variant_file = VCF("vcf_file.vcf.gz")
variant_file.set_index(index_path = "anothername.tbi")
for v in variant_file('11:435345-556565'):
  print(str(v))

This gives the following message:

[E::idx_find_and_load] Could not retrieve index file for 'vcf_file.vcf.gz'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cyvcf2/cyvcf2.pyx", line 442, in __call__
AssertionError: error loading tabix index for b'vcf_file.vcf.gz'

Specifically, the variant_file.set_index() completes successfully - but it seems that when calling variant_file(...), it still attempts to look for index file using the original file name.

Many thanks, Ismail Moghul

IsmailM commented 3 years ago

Hi,

Just following up on this.

Any ideas on this / recommendation on where to start debugging (I have been playing about with the VCF class code but to no avail)?

If this is fixed, do you think it would be possible to use CyVCF2 with Remote VCF files (e.g. on S3) as that is my end-goal? i.e., using different signed URLs for both the VCF and index...

brentp commented 3 years ago

you should be able use vcf.set_index($path) as described in the first comment in this issue. if that's not the case, send code to reproduce and i will have a look.

IsmailM commented 3 years ago

Hi @brentp,

Thanks for the reply.

I am able to reproduce the issue with the following:

# copy exemplar files from cyvcf tests
git clone https://github.com/brentp/cyvcf2
mkdir tmp;
cp cyvcf2/cyvcf2/tests/test.vcf.gz tmp;
cp cyvcf2/cyvcf2/tests/test.vcf.gz.tbi tmp;
cp cyvcf2/cyvcf2/tests/test.snpeff.bcf tmp;
cp cyvcf2/cyvcf2/tests/test-diff.csi tmp;

cd tmp;

python3
from cyvcf2 import VCF
import os

os.listdir('.')
# ['test.vcf.gz.tbi', 'test.vcf.gz', 'test-diff.csi', 'test.snpeff.bcf']

v1 = VCF('test.vcf.gz')
i = 0
for r in v1("1:1500-15715"):
  i += 1

print(i)
# 113

os.rename('test.vcf.gz.tbi', 'anothername.tbi')
os.listdir('.')
# ['anothername.tbi', 'test.vcf.gz', 'test-diff.csi', 'test.snpeff.bcf']

v2 = VCF('test.vcf.gz')
v2.set_index(index_path = "anothername.tbi")
j = 0
for r in v2("1:1500-15715"):
  j += 1

# [E::idx_find_and_load] Could not retrieve index file for 'test.vcf.gz'
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "cyvcf2/cyvcf2.pyx", line 436, in __call__
# AssertionError: error loading tabix index for b'test.vcf.gz'
print(j)
# 0

# Use the BCF and index in the tests:
os.listdir('.')
# ['anothername.tbi', 'test.vcf.gz', 'test-diff.csi', 'test.snpeff.bcf']

b = VCF('test.snpeff.bcf')
b.set_index(index_path = "test-diff.csi")
s = 0
for r in b("chr1:69427-69429"):
    s += 1

print (s)
# 1

Interestingly, the set_index method works with the BCF and CSI index, but not with the VCF/TBI index.

Any ideas?

brentp commented 3 years ago

I just pushed a fix for this. Thanks for the test-case.