apcamargo / pycoverm

Simple Python interface to CoverM's fast coverage estimation functions
GNU General Public License v3.0
7 stars 2 forks source link

Optimise mem usage, disallowing random ref order #10

Closed jakobnissen closed 1 year ago

jakobnissen commented 1 year ago

Currently, pycoverm is not particularly memory efficient. It keeps coverages in an name=>Vec IndexMap, then creates a matrix and fills the matrxi from the map. This allows the order of reference sequences in the BAM files to be in different order, but it consumes a lot of memory. Further, it's not very common to have a situation where BAM files are mapped to the exact same set of references, but in a different order, as BAM reference order is deterministic, and reflects the order of the reference FASTA file.

This commit checks that all BAM files have the same headers in the same order. It then computes an old_refindex => new_refindex map, which can be stored as a single Vec. The coverage matrix can then be preallocated and filled in. This approach is nearly perfectly memory efficient.

The only remaining issue is that the headers might take up a lot of memory, which is unnecessary in the case that the users already know the headers. This might seem trivial, but if there are 10 million sequences that each has to be heap allocated, it can take some time. A solution to this, which I employ in Vamb, is to pass a hash of the headers to the coverage computation. The headers can then be verified without consuming any memory.

apcamargo commented 1 year ago

Thanks a lot, @jakobnissen! This looks great!

Just found one possible bug:

fn index_map(headers: &[String], names: &HashSet<&str>) -> Vec<u32> {
    // Make sure there are no names in names which are not in headers.
    let mut n_seen: usize = 0;
    let mut result = vec![u32::MAX; names.len()]; // ← should be `let mut result = vec![u32::MAX; headers.len()];`, right?
    let mut new_index: u32 = 0;
    for (i, header) in headers.iter().enumerate() {
        if names.contains(&header.as_str()) {
            result[i] = new_index;
            new_index += 1;
            n_seen += 1
        }
    }
    if n_seen != names.len() {
        panic!("Some names in `names` were not found in headers")
    }
    result
}

Also, I think it would be better if the error message was "Some names in contig_set were not found in headers", since contig_set is the parameter name that is exposed to the user.

jakobnissen commented 1 year ago

Good points - you're right. Fixed in this commit.

apcamargo commented 1 year ago

Thanks! I'll merge this and get a new version out as soon as possible