BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.

https://biojulia.dev

MIT License

61 stars 20 forks source link

Add constant time fasta index seeking [NEED REVIEW] #31

Closed jakobnissen closed 3 years ago

jakobnissen commented 4 years ago

Addresses issue #29. This PR also includes #30 .

Changes:

FASTA.Index now uses a Dict instead of a vector of names. This is slower to construct and more memory inefficient, but obviously has the benefit of constant-time lookup in the index. I'd argue that if you create an index, you do so specifically in order to have fast lookup, and you probably don't care about the few extra kilobytes of RAM used, or miliseconds when constructing the index.
Added a new method: extract(r::Reader, A::Alphabet, s::AbstractString, u::UnitRange), which does the same as sequence(A, r[s], u), but does not load the entire sequence into memory first.

codecov[bot] commented 4 years ago

Codecov Report

Merging #31 into master will increase coverage by 0.71%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #31      +/-   ##
==========================================
+ Coverage   86.88%   87.60%   +0.71%     
==========================================
  Files          13       13              
  Lines         572      605      +33     
==========================================
+ Hits          497      530      +33     
  Misses         75       75

Flag	Coverage Δ
#unittests	`87.60% <100.00%> (+0.71%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fasta/index.jl	`100.00% <100.00%> (ø)`
src/fasta/reader.jl	`90.47% <100.00%> (+4.76%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d5b23c1...345d275. Read the comment docs.

TransGirlCodes commented 4 years ago

Is this to be merged? I already merged the other related PR.

CiaranOMara commented 4 years ago

The new indexing logic proposed here looks good to me. Programatically this PR is good to merge.

Though maintenance-wise there was, however, one thing that wasn't crucially obvious to me. I got caught out not realising that the spec allows differing line termination lengths/characters for different sequences in the same file. My temptation was to optimise out what I had perceived to be duplicate calculations of the number of bytes used for the line termination character. This optimisation, of course, would have been wrong. I'd like to see a comment that warns against such a temptation and a test that ensures indexing works with differing line termination characters (I didn't see such a test when I skimmed runtests.jl).

jakobnissen commented 4 years ago

Added tests and comments as requested now. If CI gives me its blessings, this can be merged.