Open gregorgorjanc opened 11 months ago
The script that we would like to change eventually is
to what Ben Jeffrey and the team uses based on sgkit
Using sgkit is very simple and nicely explained at https://pystatgen.github.io/sgkit/latest/vcf.html#download-the-data - there they download some public VCF files and show how to convert these to ZARR files which sgkit works with.
Looking at https://github.com/benjeffery/tsinfer-snakemake/blob/main/steps.py, we see the code like the one below, where tsinfer.SgkitSampleData(input[0].replace("variant_mask", ""))
is called, where we guess the input is either a ZARR file and the tsinfer.SgkitSampleData()
might be doing all the magic.
HOWEVER, the pipeline at https://github.com/benjeffery/tsinfer-snakemake is quite involved so we bette talk with Ben to discuss how to take this onwards
def generate_ancestors(input, output, wildcards, config, threads): # noqa: A002
import tsinfer
import logging
import os
from pathlib import Path
logging.basicConfig(level=logging.INFO)
data_dir = Path(config["data_dir"])
sample_data = tsinfer.SgkitSampleData(input[0].replace("variant_mask", ""))
There are two parts here.
Part 1: Check sgkit documentation how to import a VCF file
This is sgkit:
We need to learn how to use sgkit, so best to check how to import a VCF file into Python and create an sgkit object - how we do this...
Part 2: How we will move towards using sgkit in this tree seq pipeline
Once we master the above part, we should be looking into how to do what Ben Jeffrey is doing at https://github.com/benjeffery/tsinfer-snakemake/tree/main