CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
18 stars 4 forks source link

fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

Closed mlamkin7 closed 8 months ago

mlamkin7 commented 11 months ago

Added lazy option to all Genotypes Classes which when false (default) sorts output genotypes in the same order as the list of samples given (if given).

Also updated tr_harmonizer dependency so it won't error when TRTools is installed.

aryarm commented 11 months ago

@mlamkin7 and @gymreklab, I just thought of something that might be potentially super important.

According to the numpy docs, Genotypes.subset() will trigger a copy of the entire genotype matrix to be made in memory because we have to use advanced indexing to reorder the rows of the genotype matrix:

Advanced indexing always returns a copy of the data

https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

So we should probably also profile the memory of read() before and after this change? I can look into a way to do that, if you'd like. I've been meaning to do it for something else anyway (see here, for ex)

If it's true that the memory gets doubled, then my preference would be to make the samples argument into a set instead of a list. (since I think the default behavior should be as quick and as lightweight as possible.) Or, at least, maybe we default the new parameter to not performing the sorting?

mlamkin7 commented 11 months ago

A couple ideas:

  1. Now that I'm thinking about it, naming it lazy might be confusing since lazy is also a parameter of cyvcf2 and people might think that this is that parameter. What do you think about calling it something like reorder_samples and defaulting it to True?
  2. Can you add type hints to be consistent? So lazy=False would become lazy: bool = False, for example
  3. The Genotypes classes also have an __iter__() method that is supposed to work similarly to the read() method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to use subset() to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of __iter__() to continue to align with read() as much as possible

Updated 1. and 2., but 3 does pose an issue since I believe it returns one variant at a time? we may have to sort each time.

aryarm commented 11 months ago

Confirming now that running subset() together with read() does indeed double the peak memory usage :(

results

Running read() and then subset() gives us a peak memory usage of 8.8 MiB. Judging by the width and color intensity of the flamegraph bars for the read() and subset() functions, roughly half of the 8.8 MiB is due to read() while the other half is due to subset(). peak-memory Running just read() alone gives us a peak memory usage of 5.1 MiB. Most of this is due to the initial allocation of memory for the genotype matrix by read(). peak-memory For more info on how to interpret the flamegraphs, take a look at the filprofiler docs

reproducing this

in case we need it for later, here are the steps to reproduce this Using commit 56dddf46b193f38e23cf40809745873f4031bdb7, run this command

fil-profile run tests/bench_genotypes_mem.py

with this script

#!/usr/bin/env python

from filprofiler.api import profile

from haptools.data import GenotypesPLINK

gt = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
gt2 = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)

def read():
    gt.read()

def read_and_subset():
    gt2.read()
    gt2.subset(samples=gt2.samples, inplace=True)

profile(read, "fil-result/bench_read_GenotypesPLINK")
profile(read_and_subset, "fil-result/bench_read_subset_GenotypesPLINK")

using this file