Closed mlamkin7 closed 8 months ago
@mlamkin7 and @gymreklab, I just thought of something that might be potentially super important.
According to the numpy docs, Genotypes.subset()
will trigger a copy of the entire genotype matrix to be made in memory because we have to use advanced indexing to reorder the rows of the genotype matrix:
Advanced indexing always returns a copy of the data
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing
So we should probably also profile the memory of read()
before and after this change? I can look into a way to do that, if you'd like. I've been meaning to do it for something else anyway (see here, for ex)
If it's true that the memory gets doubled, then my preference would be to make the samples
argument into a set instead of a list. (since I think the default behavior should be as quick and as lightweight as possible.) Or, at least, maybe we default the new parameter to not performing the sorting?
A couple ideas:
- Now that I'm thinking about it, naming it
lazy
might be confusing sincelazy
is also a parameter ofcyvcf2
and people might think that this is that parameter. What do you think about calling it something likereorder_samples
and defaulting it to True?- Can you add type hints to be consistent? So
lazy=False
would becomelazy: bool = False
, for example- The Genotypes classes also have an
__iter__()
method that is supposed to work similarly to theread()
method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to usesubset()
to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of__iter__()
to continue to align withread()
as much as possible
Updated 1. and 2., but 3 does pose an issue since I believe it returns one variant at a time? we may have to sort each time.
Confirming now that running subset()
together with read()
does indeed double the peak memory usage :(
Running read()
and then subset()
gives us a peak memory usage of 8.8 MiB. Judging by the width and color intensity of the flamegraph bars for the read()
and subset()
functions, roughly half of the 8.8 MiB is due to read()
while the other half is due to subset()
.
Running just read()
alone gives us a peak memory usage of 5.1 MiB. Most of this is due to the initial allocation of memory for the genotype matrix by read()
.
For more info on how to interpret the flamegraphs, take a look at the filprofiler docs
in case we need it for later, here are the steps to reproduce this Using commit 56dddf46b193f38e23cf40809745873f4031bdb7, run this command
fil-profile run tests/bench_genotypes_mem.py
with this script
#!/usr/bin/env python
from filprofiler.api import profile
from haptools.data import GenotypesPLINK
gt = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
gt2 = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
def read():
gt.read()
def read_and_subset():
gt2.read()
gt2.subset(samples=gt2.samples, inplace=True)
profile(read, "fil-result/bench_read_GenotypesPLINK")
profile(read_and_subset, "fil-result/bench_read_subset_GenotypesPLINK")
using this file
Added lazy option to all Genotypes Classes which when false (default) sorts output genotypes in the same order as the list of samples given (if given).
Also updated tr_harmonizer dependency so it won't error when TRTools is installed.