BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

implement a faster and simpler eachkmer #462

Closed bicycle1885 closed 7 years ago

bicycle1885 commented 7 years ago

On my machine, composition is now ~20-30% faster for short kmers (e.g. 4-8).

~/.j/v/B/benchmarks (faster-eachkmer|…) $ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.1 (2017-03-05 13:25 UTC)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-apple-darwin16.4.0

julia> using BenchmarkTools

julia> using Bio.Seq
INFO: Recompiling stale cache file /Users/kenta/.julia/lib/v0.5/Bio.ji for module Bio.
INFO: compiling FASTA
INFO: compiling FASTQ
INFO: compiling old FASTA
INFO: compiling BED
INFO: compiling GFF3
INFO: compiling SAM
INFO: compiling VCF
WARNING: Method definition require(Symbol) in module Base at loading.jl:345 overwritten in module Main at /Users/kenta/.julia/v0.5/Requires/src/require.jl:12.

julia> srand(1234); seq = randdnaseq(10_000);

julia> @benchmark composition(each(DNAKmer{4}, seq))
BenchmarkTools.Trial:
  memory estimate:  2.20 KiB
  allocs estimate:  5
  --------------
  minimum time:     55.265 μs (0.00% GC)
  median time:      56.282 μs (0.00% GC)
  mean time:        59.166 μs (0.40% GC)
  maximum time:     2.444 ms (97.03% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{8}, seq))
BenchmarkTools.Trial:
  memory estimate:  512.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     122.389 μs (0.00% GC)
  median time:      273.634 μs (0.00% GC)
  mean time:        277.880 μs (13.36% GC)
  maximum time:     3.291 ms (79.79% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{12}, seq))
BenchmarkTools.Trial:
  memory estimate:  128.00 MiB
  allocs estimate:  6
  --------------
  minimum time:     74.796 ms (15.56% GC)
  median time:      76.573 ms (15.36% GC)
  mean time:        80.761 ms (19.41% GC)
  maximum time:     156.987 ms (58.30% GC)
  --------------
  samples:          62
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{8}, seq, 3))
BenchmarkTools.Trial:
  memory estimate:  512.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     82.236 μs (0.00% GC)
  median time:      256.368 μs (0.00% GC)
  mean time:        269.561 μs (14.29% GC)
  maximum time:     3.423 ms (81.44% GC)
  --------------
  samples:          10000
  evals/sample:     1
~/.j/v/B/benchmarks (master|…) $ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.1 (2017-03-05 13:25 UTC)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-apple-darwin16.4.0

julia> using BenchmarkTools

julia> using Bio.Seq
WARNING: Method definition require(Symbol) in module Base at loading.jl:345 overwritten in module Main at /Users/kenta/.julia/v0.5/Requires/src/require.jl:12.

julia> srand(1234); seq = randdnaseq(10_000);

julia> @benchmark composition(each(DNAKmer{4}, seq))
BenchmarkTools.Trial:
  memory estimate:  2.20 KiB
  allocs estimate:  5
  --------------
  minimum time:     67.114 μs (0.00% GC)
  median time:      68.988 μs (0.00% GC)
  mean time:        72.054 μs (0.33% GC)
  maximum time:     2.428 ms (96.46% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{8}, seq))
BenchmarkTools.Trial:
  memory estimate:  512.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     159.216 μs (0.00% GC)
  median time:      349.833 μs (0.00% GC)
  mean time:        343.539 μs (12.50% GC)
  maximum time:     3.673 ms (80.46% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{12}, seq))
BenchmarkTools.Trial:
  memory estimate:  128.00 MiB
  allocs estimate:  6
  --------------
  minimum time:     74.641 ms (15.39% GC)
  median time:      76.195 ms (15.36% GC)
  mean time:        80.647 ms (19.44% GC)
  maximum time:     158.713 ms (57.95% GC)
  --------------
  samples:          62
  evals/sample:     1

julia> @benchmark composition(each(DNAKmer{8}, seq, 3))
BenchmarkTools.Trial:
  memory estimate:  512.16 KiB
  allocs estimate:  6
  --------------
  minimum time:     113.664 μs (0.00% GC)
  median time:      311.359 μs (0.00% GC)
  mean time:        300.800 μs (14.07% GC)
  maximum time:     3.596 ms (82.28% GC)
  --------------
  samples:          10000
  evals/sample:     1
bicycle1885 commented 7 years ago

The performance of kmer iterator improves for longer k-mers, too. However, for long k-mers, the cost of array access becomes dominant in composition as shown above. The following benchmarks do not access arrays so it is apparent that the performance has improved.

this branch:

julia> @benchmark count(_->true,each(DNAKmer{8}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     27.570 μs (0.00% GC)
  median time:      28.470 μs (0.00% GC)
  mean time:        30.255 μs (0.00% GC)
  maximum time:     118.795 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark count(_->true,each(DNAKmer{16}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     27.635 μs (0.00% GC)
  median time:      30.678 μs (0.00% GC)
  mean time:        32.448 μs (0.00% GC)
  maximum time:     120.867 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark count(_->true,each(DNAKmer{32}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     27.689 μs (0.00% GC)
  median time:      27.756 μs (0.00% GC)
  mean time:        30.375 μs (0.00% GC)
  maximum time:     265.790 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

master:

julia> @benchmark count(_->true,each(DNAKmer{8}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     37.417 μs (0.00% GC)
  median time:      37.497 μs (0.00% GC)
  mean time:        40.360 μs (0.00% GC)
  maximum time:     171.461 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark count(_->true,each(DNAKmer{16}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     35.936 μs (0.00% GC)
  median time:      36.722 μs (0.00% GC)
  mean time:        38.952 μs (0.00% GC)
  maximum time:     151.353 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark count(_->true,each(DNAKmer{32}, seq))
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  4
  --------------
  minimum time:     35.939 μs (0.00% GC)
  median time:      37.472 μs (0.00% GC)
  mean time:        38.220 μs (0.00% GC)
  maximum time:     158.646 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
codecov-io commented 7 years ago

Codecov Report

Merging #462 into master will decrease coverage by 0.14%. The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #462      +/-   ##
==========================================
- Coverage   65.17%   65.02%   -0.15%     
==========================================
  Files         159      159              
  Lines       11328    11335       +7     
==========================================
- Hits         7383     7371      -12     
- Misses       3945     3964      +19
Impacted Files Coverage Δ
src/seq/composition.jl 63.26% <100%> (ø) :arrow_up:
src/seq/kmer.jl 86.51% <66.66%> (+0.3%) :arrow_up:
src/seq/eachkmer.jl 94.23% <97.56%> (+0.61%) :arrow_up:
src/seq/nmask.jl 26.38% <0%> (-25.01%) :arrow_down:
src/seq/refseq.jl 88.57% <0%> (-1.43%) :arrow_down:
src/align/alignment.jl 85.14% <0%> (-1%) :arrow_down:
src/intervals/interval.jl 92% <0%> (+1.33%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 70ea67e...46a1b38. Read the comment docs.

bicycle1885 commented 7 years ago

I'll merge this soon if no objections.