C++ library design - Githubissues

standage commented 7 years ago

The vast majority of khmer's functionality depends on two data data structures (sketches?): Bloom filter and Count-min sketch. And REALLY these are pretty much the same data structure at the conceptual and interface level, the only difference being whether the internal implementation stores k-mer counts or binary k-mer presence/absence values.

The khmer code base now includes several variations on this core data structure: different hash functions, with or without graph traversal, etc., with plans for even more. Supporting all of these variations long term is going to require some investment in refactoring, I think. Adding new functionality (like new hash functions or support for new input formats) currently involves touching a lot of code, and has proven more difficult and time intensive than anticipated. Plus, with our medium- to long-term plan to transition to Cython for the C++/Python glue, I think there's occasion to consider what we could be doing more cleanly at the C++ level (example). (I of course acknowledge the tension between engineering the software and doing useful stuff with the software, a perennial source of headache as @ctb so eloquently remarked in his latest Moore Foundation report.)

In this issue I just wanted to collect all of the relevant components and dimensions we need to consider, as a reference for future work on the core library. Anything I'm missing?

storage
- 8 bits per entry (k-mer counts)
- 4 bits per entry (small k-mer counts)
- 1 bit per entry (k-mer presence/absence)
- potentially others (such as 16 bit counters)?
data input format
- Fasta/Fastq
- BAM (also supported by SeqAn)
strategies for handling non-ACGT characters in the input
- ignore entire sequences containing non-ACGT
- ignore only k-mers containing non-ACGT
- convert non-ACGT characters to A
- regardless of strategy:
  - collect and report statistics on the number of reads / k-mers ignored or modified
  - preserve original read sequence for any output produced, regardless of how read is used internally
hash functions
- reversible hash supporting graph traversal
- non-reversible hash supporting k>32
- strand-specific hash function?

ctb commented 7 years ago

-1 on investing in large-scale refactoring - my experience in this project and others suggests that this is a fantastic way to expend a vast amount of time and effort on something with dubious benefits. Essentially you get architecture astronaut behavior :). You also (together or separately) get a lot of unfinished PRs.

+0 on using the Cython transition as a way to re-engage with some of the C++ code design. Conveniently @camillescott is already doing a detailed first pass as part of #1502 :)

There are several other ideas to add below -- off the top of my head,

other implementations of the data types (non-Bloom-filter based), e.g. to be more cache local;
the banding stuff;

The looming issue for me is the question of transitioning to the oxli name. I'm not sure how to approach that but I still want to do it.

There's also issues of stability of funding.

And I'd like to offer up the comment that (IMO) the primary goal of khmer is to support our (and others') research efforts. I'm positive about refactoring and so on if it going to simplify, enable better correctness, and speed up R&D.

standage commented 7 years ago

I'm positive about refactoring and so on if it going to simplify, enable better correctness, and speed up R&D.

Of course. I still do occasionally take time to work on khmer for khmer's sake, but recently my contributions have all be in support of what I need to get done with kevlar. I'm not suggesting we reorganize (or redisorganize :-) for its own sake, but to simplify, to bring the model reflected in our code into better alignment with our conceptual model. To get rid of duplicated code, to decouple classes that don't need to be tightly coupled. I think this will do a lot to speed up R&D.

In principle, I agree with the sentiment that gradual refactoring is preferable to large-scale refactoring. That said, recent experience shows that some (even small) changes end up touching a lot of code, and require a lot of discipline not to try to Solve All The Problems.

ctb commented 7 years ago

recent experience shows that some (even small) changes end up touching a lot of code, and require a lot of discipline not to try to Solve All The Problems.

Right, and if you scale that up to big changes, then you're trying to touch ALL the code and REALLY solve All the Problems...

dib-lab / khmer

C++ library design #1592