BioJulia / Kmers.jl

In development: Kmer types and methods for julia
MIT License
21 stars 7 forks source link

Breaking changes for v1 #35

Open jakobnissen opened 8 months ago

jakobnissen commented 8 months ago

This PR implements a thorough overhaul of Kmers.jl. There are too many changes to list exhaustively, but here are some important changes

TODO

Closes #12 Closes #13 (with the answer "no") Closes #17 Closes #23 Closes #28 Closes #31 Closes #34

jakobnissen commented 8 months ago

Thank you for the kind words, @cjprybol!

I'm planning to post on Discourse one of the coming days about this upcoming release and request feedback. There is still a little more work to do - for example, this PR needs changes to BioSequences to land first. Nonetheless - yes! I would like a review. To test it, dev the kmers_compat branch of BioSequences (there is an open PR), then dev Kmers. The best review you can give is a usability one. What could be nicer, API wise? Any missing but important functionality? I'm pretty comfortable with the lower level implementation.

We don't back Kmers with SVector, because a) we want to avoid StaticArrays as a large dependency and b) we need to do most of the low-level bit twiddling ourselves anyway, since symbols are stored in a packed format in Kmers (e.g. only 2 bits for a nucleotide instead of whole byte).

A fw/rv iterator might be a good idea! I've currently implemented a canonical kmer iterator that only outputs the lexographically smaller of the two. But I think a fw/rv iterator that yields both may be more generically usable.

kescobo commented 8 months ago

I've currently implemented a canonical kmer iterator that only outputs the lexographically smaller of the two.

I smell MinHash!

In the discourse post, you note that a bunch of other more complicated convenience features were removed or not implemented. I infer that in most cases, the intent is for those things to be implemented in other packages, but are there any features that you think will belong here in a point release after the API is stable?

jakobnissen commented 8 months ago

@kescobo In the discourse post, I only mention skipmers, minimizers and k-min-mers. In the message of this PR, I mention other misc features such as random kmers.

I believe skipmers were an invention of Sebarina Ward's research group, which is why they were originally implemented in BioSequences.jl (and Kmers.jl). As far as I know, they have not had much usage in the wider bioinfo world. One notable exception is the bwa-mem algorithm of popular aligner bwa by Heng Li. However, minimizers have been shown to be superior to kmers and skipmers for most applications, and indeed Heng Li have created minimap2 using minimizers, which is mostly a successor to bwa (and is mostly just better though it's complicated). So, while it's a little awkward to remove something so explicitly added by the original author, I don't believe skipmers really belong in a generic package like Kmers.jl, though I could be persuaded otherwise.

Minimizers, and their even newer relative k-min-mers, are more useful, but the precise algorithm for extracting them differ from application to application, so I'm guessing users would want to create their own minimizer iterators. I'd rather provide functionality to more easily make custom iterators than to provide my own which is probably not optimal for most users.

Regarding the other functionality - shuffled and random kmers, I don't think they are that useful. I'll add them if someone needs them and makes a case for them. For the constructors that infer the length and alphabet, I don't think it's a good interface. Much better to be explicit.

I smell MinHash!

Yes! You can combine MinHash.jl with this iterator to have a pretty efficient minhash:

julia> using MinHash, BioSequences, Kmers, BenchmarkTools

julia> seq = randdnaseq(10_000_000);

julia> @btime sketch(CanonicalDNAMers{31}(seq), 100)
  35.612 ms (8 allocations: 6.88 KiB)

That's 3.5 nanoseconds per kmer. That could plausibly become even faster, as I'm planning to optimise hashing of kmers as much as humanly possible.

kescobo commented 8 months ago

Great! So I take this to mean that you consider Kmers.jl feature complete, barring someone making a strong affirmative case for something. Or put another way, there aren't any features you have explicitly planned to implement after the 1.0 release (at least not in this package).

FWIW, I think this is a good decision - since you're aggressively optimizing performance here, it makes sense to keep the surface as small as possible.

I'd rather provide functionality to more easily make custom iterators than to provide my own which is probably not optimal for most users.

This seems really great. Are there any other things that are (or should be planned to be) extensible like this? Is it possible for other packages to make their own kmer types, and if so, is the interface well-documented? (Noting that I haven't yet looked at the code or docs, if this is explained elsewhere, no need to repeat yourself).

So, while it's a little awkward to remove something so explicitly added by the original author, I don't believe skipmers really belong in a generic package like Kmers.jl

I don't think it's awkward. That code is still all there in the history, and can be revived if needed. I don't think she'd expect you to maintain her code forever. One consideration might be whether it is in-principle possible to reimplement in a package that extends this one, but even if not, I don't think you need to worry about it.

camilogarciabotero commented 8 months ago

I tried to make some changes but wrongfully created an alternative PR (already closed), here are some small typos I found in the docs

camilogarciabotero commented 8 months ago

I was experimenting a little bit with the each_dna_codon method and found an unexpected behavior. This is from the docs:

julia> collect(each_dna_codon("TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 TGA
 CGA
 TCG

But what if the input seq is of a BioSequence type:

collect(each_dna_codon(dna"TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 TGA
 TGA
 TGA

Shouldn't it be allowed?

camilogarciabotero commented 8 months ago

I was playing around more. This is probably an undesired behavior or something not yet implemented... But as the interfaces of BioSequences and Kmers share some methods I was wondering if the find* methods will work in some ways between BioSequence and Kmer types:

  1. Symbol finding with findall:
findall(DNA_T, dna"ATGCTG")
2-element Vector{Int64}:
 2
 5

will this work:

findall(mer"T"d,   dna"ATGCTG")
  1. BioRegex finding with findfirst and findlast
findfirst(biore"ATG"dna, dna"ATGCTG", 1)
1:3

will this work:

findfirst(mer"T"d,   dna"ATGCTG")

I didn't see this in the in this PR, but ignore this comment if this is already something discussed. Best

BeastyBlacksmith commented 7 months ago

Can't wait to have it 😍