HighlanderLab / tree_seq_pipeline

Pipeline to infer tree sequences with different datasets
MIT License
3 stars 7 forks source link

Check sgkit documentation how to import a VCF file & how we will move towards using sgkit in this tree seq pipeline #55

Open gregorgorjanc opened 11 months ago

gregorgorjanc commented 11 months ago

There are two parts here.

Part 1: Check sgkit documentation how to import a VCF file

This is sgkit:

We need to learn how to use sgkit, so best to check how to import a VCF file into Python and create an sgkit object - how we do this...

Part 2: How we will move towards using sgkit in this tree seq pipeline

Once we master the above part, we should be looking into how to do what Ben Jeffrey is doing at https://github.com/benjeffery/tsinfer-snakemake/tree/main

AprilYUZhang commented 11 months ago

The script that we would like to change eventually is

https://github.com/HighlanderLab/tree_seq_pipeline/blob/main/Snakemake/workflow/scripts/Prepare_tsinfer_sample_file.py

to what Ben Jeffrey and the team uses based on sgkit

AprilYUZhang commented 11 months ago

Using sgkit is very simple and nicely explained at https://pystatgen.github.io/sgkit/latest/vcf.html#download-the-data - there they download some public VCF files and show how to convert these to ZARR files which sgkit works with.

AprilYUZhang commented 11 months ago

Looking at https://github.com/benjeffery/tsinfer-snakemake/blob/main/steps.py, we see the code like the one below, where tsinfer.SgkitSampleData(input[0].replace("variant_mask", "")) is called, where we guess the input is either a ZARR file and the tsinfer.SgkitSampleData() might be doing all the magic.

HOWEVER, the pipeline at https://github.com/benjeffery/tsinfer-snakemake is quite involved so we bette talk with Ben to discuss how to take this onwards

def generate_ancestors(input, output, wildcards, config, threads):  # noqa: A002
    import tsinfer
    import logging
    import os
    from pathlib import Path

    logging.basicConfig(level=logging.INFO)
    data_dir = Path(config["data_dir"])
    sample_data = tsinfer.SgkitSampleData(input[0].replace("variant_mask", ""))