Closed burakalver closed 2 years ago
It did not turn out too difficult to put something together based on scripts/ancestry-predict.py
.
I can add some more error handling if you think this is useful. But I'll leave it for now in case you don't think the use-case is common enough or you prefer to do it in nim.
from pathlib import Path
import numpy as np
def read_somalier(path):
"""
take a path to a single .somalier file and return a simple datastructure
containing the sample information
"""
data = Path(path).read_bytes()
version = int.from_bytes(data[:1], byteorder="little")
assert version == 2, ("bad version for:", path)
data = data[1:]
sample_strlen = int.from_bytes(data[:1], byteorder="little")
data = data[1:]
sample = data[:sample_strlen].decode()
data = data[sample_strlen:]
nsites = int.from_bytes(data[:2], byteorder="little")
data = data[2:]
nxsites = int.from_bytes(data[:2], byteorder="little")
data = data[2:]
nysites = int.from_bytes(data[:2], byteorder="little")
data = data[2:]
sites = np.frombuffer(data[:nsites * 3 * 4], dtype=np.uint32).reshape((nsites, 3))
data = data[nsites * 3 * 4:]
x_sites = np.frombuffer(data[:nxsites * 3 * 4], dtype=np.uint32).reshape((nxsites, 3))
data = data[nxsites * 3 * 4:]
y_sites = np.frombuffer(data[:nysites * 3 * 4], dtype=np.uint32).reshape((nysites, 3))
return dict(sample=sample, sites=sites, x_sites=x_sites, y_sites=y_sites)
def write_somalier(data, path):
file = open(path, "wb")
file.write(int(2).to_bytes(1, byteorder='little'))
sample = data["sample"]
file.write(len(sample).to_bytes(1, byteorder='little'))
file.write(sample.encode())
sites = data["sites"]
file.write(len(sites).to_bytes(2, byteorder='little'))
x_sites = data["x_sites"]
file.write(len(x_sites).to_bytes(2, byteorder='little'))
y_sites = data["y_sites"]
file.write(len(y_sites).to_bytes(2, byteorder='little'))
file.write(sites.tobytes())
file.write(x_sites.tobytes())
file.write(y_sites.tobytes())
file.close()
def merge_somaliers(in_paths, out_path):
data = read_somalier(in_paths[0])
sample = data["sample"]
sites = data["sites"].copy()
sites.fill(0)
x_sites = data["x_sites"].copy()
x_sites.fill(0)
y_sites = data["y_sites"].copy()
y_sites.fill(0)
for path in in_paths:
data = read_somalier(path)
if sample != data["sample"]:
raise ValueError("sample in different files are different. Code intends to merge one sample")
sites += data["sites"]
x_sites += data["x_sites"]
y_sites += data["y_sites"]
data = dict(sample=sample, sites=sites, x_sites=x_sites, y_sites=y_sites)
write_somalier(data, out_path)
nice job figuring this out! this is very useful. If you don't mind, I might add it to the wiki.
Hi @brentp and @burakalver,
Thank you for sharing this! It has been quite helpful.
After merging the files, I noticed that the size remains same for all the merged.somalier
files. This is my first time working with somalier
files. Do you know if there a way to verify that the merging was successfully completed?
Hi, I have a large joint calling result that is represented in multiple vcfs, one file per each chromosome. I would like to get a whole genome somalier extract result for them. I could merge the vcfs and run extract. But it would be much nicer if I could either provide multiple vcf files or merge the extracted somalier files. Is there a way to do either? Or maybe a utility to convert the somalier file back and forth to a text representation, which could have other uses also?
Thanks in advance!