cerebis / qc3C

Reference-free quality assessment for Hi-C sequencing data
GNU Affero General Public License v3.0
12 stars 1 forks source link

Kmer analysis is slow due to Jellyfish SWIG overhead. #27

Open cerebis opened 5 years ago

cerebis commented 5 years ago

Jellyfish provides Python, along with Perl and Ruby bindings via SWIG. SWIG unfortunately is not efficient with fine grained call-stack.

In qc3C the many calls to query the jellyfish kmer database in analyze.collect_coverage() add up enormously, making this method the most time consuming call beneath the primary analyze() method, accounting for 51% of time.

Replacing SWIG with a simple pybind11 binding reduced this to 30%.

It would likely be better to export the sliding window depth analysis entirely to C++, returning just the inner and outer mean depths.

After doing so, the question would be how to best integrate this binding.

A manual build that produces an importable Python module, where the define is necessary to get SSE/Altivec type definititions from config.h. This include is generated by a call to ./configure.

# using a tarball release of jellyfish
tar xzf jellyfish-2.2.10.tar.gz
cd jellyfish-2.2.10
autoreconf -i
./configure --prefix=/usr/local
g++ -O3 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` jelly.cpp `ls lib/*` -o jelly`python3-config --extension-suffix` -Iinclude -I. -DHAVE_CONFIG_H=true

Example binding, implementing only methods we require. Here I have adapted the helper class QueryMerFile used by the Jellyfish SWIG binding.

#include <pybind11/pybind11.h>
#include <fstream>
#include <stdexcept>
#include "jellyfish/mer_dna.hpp"
#include "jellyfish/file_header.hpp"
#include "jellyfish/mer_dna_bloom_counter.hpp"
#include "jellyfish/hash_counter.hpp"
#include "jellyfish/jellyfish.hpp"

namespace py = pybind11;

class QueryMerFile {

    std::unique_ptr<jellyfish::mer_dna_bloom_filter> bf;
    jellyfish::mapped_file                           binary_map;
    std::unique_ptr<binary_query>                    jf;

    public:

    QueryMerFile(const char* path) {

        std::ifstream in(path);
        if(!in.good())
            throw std::runtime_error(std::string("Can't open file '") + path + "'");

        jellyfish::file_header header(in);
        jellyfish::mer_dna::k(header.key_len() / 2);

        if(header.format() == "bloomcounter") {
            jellyfish::hash_pair<jellyfish::mer_dna> fns(header.matrix(1), header.matrix(2));
            bf.reset(new jellyfish::mer_dna_bloom_filter(header.size(), header.nb_hashes(), in, fns));
            if(!in.good())
            throw std::runtime_error("Bloom filter file is truncated");
        } 
        else if(header.format() == "binary/sorted") {
            binary_map.map(path);
            jf.reset(new binary_query(binary_map.base() + header.offset(), header.key_len(), header.counter_len(), header.matrix(),
                     header.size() - 1, binary_map.length() - header.offset()));
        } 
        else {
           throw std::runtime_error(std::string("Unsupported format '") + header.format() + "'");
        }
    }

    unsigned int check(const jellyfish::mer_dna& m) { return jf ? jf->check(m) : bf->check(m); }
};

PYBIND11_MODULE(jelly, m) {
    py::class_<jellyfish::mer_dna>(m, "MerDNA")
            .def(py::init<const char*>())
            .def("canonicalize", &jellyfish::mer_dna::canonicalize);

    py::class_<QueryMerFile>(m, "QueryMerFile")
            .def(py::init<const char *>())
            .def("__getitem__", &QueryMerFile::check);
}