brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

merging somalier files for different chromosomes #97

Closed burakalver closed 2 years ago

burakalver commented 2 years ago

Hi, I have a large joint calling result that is represented in multiple vcfs, one file per each chromosome. I would like to get a whole genome somalier extract result for them. I could merge the vcfs and run extract. But it would be much nicer if I could either provide multiple vcf files or merge the extracted somalier files. Is there a way to do either? Or maybe a utility to convert the somalier file back and forth to a text representation, which could have other uses also?

Thanks in advance!

burakalver commented 2 years ago

It did not turn out too difficult to put something together based on scripts/ancestry-predict.py.

I can add some more error handling if you think this is useful. But I'll leave it for now in case you don't think the use-case is common enough or you prefer to do it in nim.

from pathlib import Path
import numpy as np

def read_somalier(path):
    """
    take a path to a single .somalier file and return a simple datastructure
    containing the sample information
    """
    data = Path(path).read_bytes()
    version = int.from_bytes(data[:1], byteorder="little")
    assert version == 2, ("bad version for:", path)
    data = data[1:]
    sample_strlen = int.from_bytes(data[:1], byteorder="little")
    data = data[1:]
    sample = data[:sample_strlen].decode()
    data = data[sample_strlen:]

    nsites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]
    nxsites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]
    nysites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]

    sites = np.frombuffer(data[:nsites * 3 * 4], dtype=np.uint32).reshape((nsites, 3))
    data = data[nsites * 3 * 4:]
    x_sites = np.frombuffer(data[:nxsites * 3 * 4], dtype=np.uint32).reshape((nxsites, 3))
    data = data[nxsites * 3 * 4:]
    y_sites = np.frombuffer(data[:nysites * 3 * 4], dtype=np.uint32).reshape((nysites, 3))

    return dict(sample=sample, sites=sites, x_sites=x_sites, y_sites=y_sites)

def write_somalier(data, path):
    file = open(path, "wb")
    file.write(int(2).to_bytes(1, byteorder='little'))
    sample = data["sample"]
    file.write(len(sample).to_bytes(1, byteorder='little'))
    file.write(sample.encode())

    sites = data["sites"]
    file.write(len(sites).to_bytes(2, byteorder='little'))
    x_sites = data["x_sites"]
    file.write(len(x_sites).to_bytes(2, byteorder='little'))
    y_sites = data["y_sites"]
    file.write(len(y_sites).to_bytes(2, byteorder='little'))

    file.write(sites.tobytes())
    file.write(x_sites.tobytes())
    file.write(y_sites.tobytes())
    file.close()

def merge_somaliers(in_paths, out_path):
    data = read_somalier(in_paths[0])
    sample = data["sample"]
    sites = data["sites"].copy()
    sites.fill(0)
    x_sites = data["x_sites"].copy()
    x_sites.fill(0)
    y_sites = data["y_sites"].copy()
    y_sites.fill(0)
    for path in in_paths:
        data = read_somalier(path)
        if sample != data["sample"]:
            raise ValueError("sample in different files are different. Code intends to merge one sample")
        sites += data["sites"]
        x_sites += data["x_sites"]
        y_sites += data["y_sites"]
    data = dict(sample=sample, sites=sites, x_sites=x_sites, y_sites=y_sites)
    write_somalier(data, out_path)
brentp commented 2 years ago

nice job figuring this out! this is very useful. If you don't mind, I might add it to the wiki.

brentp commented 2 years ago

cheers: https://github.com/brentp/somalier/wiki/Frequently-Asked-Questions

Shrishtee-kandoi commented 6 months ago

Hi @brentp and @burakalver,

Thank you for sharing this! It has been quite helpful. After merging the files, I noticed that the size remains same for all the merged.somalier files. This is my first time working with somalier files. Do you know if there a way to verify that the merging was successfully completed?