PyCoverM not memory efficient for lots of files

jakobnissen commented 1 year ago

We sometimes want to get abundances for, say, 1000 BAM files mapped to 10 million contigs. With the current construction of pycoverm, this is not feasible since it will exhaust all RAM. Hence, we have to resort to another depth estimator for large projects, which is a shame as I rather like this package.

In my own contig coverage calculator, I computed the coverage of each file independently, then serialized the resulting depth into a file, making sure they were all the same length (i.e. each BAM file had the same number of references). After all files had been processed, I made one single matrix in memory, then deserialized the files one at a time and filled in the columns in my large matrix.

Perhaps a similar strategy could be done for pycoverm?

apcamargo commented 1 year ago

So, you're suggesting an interface to process the BAM files sequentially? could you achieve that by looping through the files, processing one-by-one with pycoverm, and writing the serialized matrices? Then you would have to make sure that the rows are the same and in the same order for all matrices.

Sorry if I understood your suggestion wrong

jakobnissen commented 1 year ago

Oh right. Of course I could just do it in whatever tools I call pycoverm from instead of having the logic in pycoverm. Do you have a hunch what might be the memory bottleneck of pycoverm?

apcamargo commented 1 year ago

I never profiled memory usage, but my guess is the following:

This part of the code reads the BAM files sequentially and store the coverage values within a IndexMap whose keys are contig names and the values are vectors that hold the coverages:

    let mut contig_coverages = IndexMap::new();
    match &estimators_and_taker.taker {
        CoverageTakerType::CachedSingleFloatCoverageTaker {
            stoit_names: _,
            entry_names,
            coverages,
            current_stoit_index: _,
            current_entry_index: _,
            num_coverages: _,
        } => {
            for input_bam in coverages {
                for coverage_entry in input_bam {
                    if let Some(contig_name) = &entry_names[coverage_entry.entry_index] {
                        let contig_coverage_vector =
                            contig_coverages.entry(contig_name).or_insert(vec![]);
                        contig_coverage_vector.push(coverage_entry.coverage);
                    }
                }
            }
        }
        _ => unreachable!(),
    }

Then, I create another two vectors that will store the names and coverages of all contigs:

    let mut coverage_vector: std::vec::Vec<f32> = Vec::new();
    let mut contig_names_vector = Vec::new();

    if filter_contigs {
        for (contig_name, contig_coverage_vector) in contig_coverages.iter() {
            if contig_set.contains(contig_name.as_str()) {
                coverage_vector.extend(contig_coverage_vector);
                contig_names_vector.push(*contig_name);
            }
        }
    } else {
        for (contig_name, contig_coverage_vector) in contig_coverages.iter() {
            coverage_vector.extend(contig_coverage_vector);
            contig_names_vector.push(*contig_name);
        }
    }

This means that at some point you will have the coverages stored in two different places in memory. I think I did this at the time to allow filtering contigs, but this is certainly not the ideal solution.

jakobnissen commented 1 year ago

Why have the IndexMap in the first place? Is it so you support reading two BAM files which contain the same sequences, but in a different order? If not, perhaps it could be stored in simply a single vector of names, and another single vector of coverages (interpreted as a 2D array)? Of course, it'd need to check all files actually have the same headers.

apcamargo commented 1 year ago

Yes, that's why. I think that for most use cases this wouldn't happen, though.

I'd be ok with changing this behaviour, but I don't have the time to work on this right now. So it may take a while :(

jakobnissen commented 1 year ago

That's fine, if necessary I can fork pycoverm and maybe make a PR. I'll keep this issue open for now :)

apcamargo commented 1 year ago

I just uploaded the wheels to PyPI. I can leave this issue open in case you feel we can have a better interface to deal with the headers

jakobnissen commented 1 year ago

Cool. Okay, let's close it - I can make another PR if needed and I can think of a nice interface

apcamargo / pycoverm

PyCoverM not memory efficient for lots of files #9