bihealth / vcfpy

Python 3 library with good support for both reading and writing VCF
MIT License
93 stars 20 forks source link

Help subsetting samples #134

Open oyhel opened 5 years ago

oyhel commented 5 years ago

Description

I need to subset samples from a large VCF with approximately 100.000 samples. As the extraction is the only part of the workflow outside python it would be very nice to be able to use vcfpy for this. I read the docs and also tried to deduce how the package works from the source. However, I am coming up short and would very much appreciate some help understanding how the subsetting of samples works.

What I Did

import vcfpy
reader = vcfpy.Reader.from_path()

subset_samples = sel = ['sample123', 'sample124']
reader = vcfpy.Reader.from_path('myfile.vcf.gz', parsed_samples=subset_samples)

reader.parsed_samples returns the samples chosen to be parsed in the input: ['sample123', 'sample124']

When iterating over the records in the reader object: for record in reader: record

I get:

<vcfpy.record.UnparsedCall object at 0x7f065fd30f98>

So I am assuming that the parser respects my list of samples to be parsed. However I am struggling with getting the calls for only these samples.

My goal is to extract all markers, but only for a few samples.

Any help would be very much appreciated!

kylec commented 4 years ago

I have the same problem. I set parsed_samples for a subset of samples like you did but reader.calls still return calls for all the samples.

I got around it by finding the index of my samples.

sample_index = vcf_reader.header.samples.name_to_idx[subset_samples]
for row in reader:
  row.calls[sample_index]
holtgrewe commented 4 years ago

Hi, I think that you are looking for Record.call_for_sample.

for record in reader:
    for sample in sel:
        call = reader.call_for_sample[sample]