CiaranOMara commented 4 years ago

Types of changes

This PR removes the parsing machinery from GenomicFeatures. The removal of the parsing machinery allows the GenomicFeatures package to focus on genomic interval/feature algorithms.

:clipboard: Additional detail

TheGenomicFeatures.Interval type was renamed to un-shadow the IntervalTrees.Interval type.
The Interval collection can now handle AbstractGenomicInterval types.

:ballot_box_with_check: Checklist

[x] :art: The changes implemented is consistent with the julia style guide.
[x] :blue_book: I have updated and added relevant docstrings, in a manner consistent with the documentation styleguide.
[x] :blue_book: I have added or updated relevant user and developer manuals/documentation in docs/src/.
[x] :ok: There are unit tests that cover the code changes I have made.
[x] :ok: The unit tests cover my code changes AND they pass.
[x] :pencil: I have added an entry to the [UNRELEASED] section of the manually curated CHANGELOG.md file for this repository.
[x] :ok: All changes should be compatible with the latest stable version of Julia.
[x] :thought_balloon: I have commented liberally for any complex pieces of internal code.

TransGirlCodes commented 4 years ago

Ok the first thing to do I suppose is to rebase it onto the current state of master and get rid of file conflicts.

codecov[bot] commented 4 years ago

Codecov Report

Merging #20 into master will increase coverage by 3.67%. The diff coverage is 67.69%.

@@            Coverage Diff             @@
##           master      #20      +/-   ##
==========================================
+ Coverage    73.7%   77.37%   +3.67%     
==========================================
  Files          32        8      -24     
  Lines        1715      389    -1326     
==========================================
- Hits         1264      301     -963     
+ Misses        451       88     -363

Impacted Files	Coverage Δ
src/GenomicFeatures.jl	`100% <ø> (ø)`
src/queue.jl	`52.63% <0%> (+0.9%)`	:arrow_up:
src/coverage.jl	`92.98% <100%> (-0.2%)`	:arrow_down:
src/position.jl	`42.1% <42.1%> (ø)`
src/overlap.jl	`82.6% <60.86%> (+0.64%)`	:arrow_up:
src/intervalcollection.jl	`80.61% <62.06%> (+3.6%)`	:arrow_up:
src/interval.jl	`70.31% <78.94%> (-3.22%)`	:arrow_down:
src/strand.jl	`88.37% <80%> (+4.65%)`	:arrow_up:
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9e256b2...63eb6dd. Read the comment docs.

TransGirlCodes commented 4 years ago

Awesome CI is good for OSX and Linux, for Windows we need to edit the appveyor config. It looks like it doesen't like the ' character being used to delimit strings.

CiaranOMara commented 4 years ago

Would you object to having appveyor removed and having Windows added to the Travis test matrix?

TransGirlCodes commented 4 years ago

If travis can do that then yes absolutely, I thought it could only do OS X and Linux!

On Thu, Nov 14, 2019 at 22:51, CiaranOMara notifications@github.com wrote:

Would you object to having appveyor removed and having Windows added to the Travis test matrix?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

CiaranOMara commented 4 years ago

The GenomicPosition works with the GenomicIntervalCollection. In fact, the IntervalCollection can hold mixed types and still work. There are however some weaknesses around the Queue (L443-L455) that are sensitive to parameter order, but they can be addressed later.

Given that this would be a major release, do you think the deprecation warnings are necessary? I'd prefer not to include them as I think they add complexity and clutter.

The other thing I am uncertain about is whether the historic notes regarding iteration should be stripped out - I didn't look at iteration in any detail. Would it be useful to keep them?

TransGirlCodes commented 4 years ago

Do the mixed types in the collection create type uncertainty? I'm wondering if it might clobber performance? I think deprecation warnings or even deprecated functions - given this is v2.0, can probably just be stripped out.

CiaranOMara commented 4 years ago

Below is an attempt at using BenchmarkTools.jl to compare the performance of the eachoverlap function in GenomicFeatures v1 and this PR. The first set of benchmark results are for GenomicFeatures v1 - the script is available at BenchmarkGenomicFeatures.jl.

shell> sh run.sh

...

("Array{Interval{Int64},1}", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.28 GiB
  allocs estimate:  43084035
  --------------
  minimum time:     20.106 s (20.20% GC)
  median time:      20.106 s (20.20% GC)
  mean time:        20.106 s (20.20% GC)
  maximum time:     20.106 s (20.20% GC)
  --------------
  samples:          1
  evals/sample:     1
("IntervalCollection{Int64}", "IntervalCollection{Int64}") BenchmarkTools.Trial: 
  memory estimate:  1.02 GiB
  allocs estimate:  34379882
  --------------
  minimum time:     11.426 s (5.89% GC)
  median time:      11.426 s (5.89% GC)
  mean time:        11.426 s (5.89% GC)
  maximum time:     11.426 s (5.89% GC)
  --------------
  samples:          1
  evals/sample:     1

The second set of benchmarking results is for this PR.

shell> julia --color="yes" --project=benchmark/ benchmark/benchmarks.jl

...

("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  7.28 GiB
  allocs estimate:  170812368
  --------------
  minimum time:     34.424 s (29.64% GC)
  median time:      34.424 s (29.64% GC)
  mean time:        34.424 s (29.64% GC)
  maximum time:     34.424 s (29.64% GC)
  --------------
  samples:          1
  evals/sample:     1
("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial: 
  memory estimate:  1.02 GiB
  allocs estimate:  34379882
  --------------
  minimum time:     13.039 s (22.26% GC)
  median time:      13.039 s (22.26% GC)
  mean time:        13.039 s (22.26% GC)
  maximum time:     13.039 s (22.26% GC)
  --------------
  samples:          1
  evals/sample:     1

CiaranOMara commented 4 years ago

Commit https://github.com/BioJulia/GenomicFeatures.jl/pull/20/commits/fb8e364a08bb9219fdb64758909989254dedc4f1 addresses the issue of memory allocation when Array{GenomicInterval{Int64},1} is present. Memory allocation is down from around 7.29 GiB and is now inline with v1's allocation of about 1.29 GiB. The commit also removes the Queue's sensitivities to mixed types (L454-R455).

CiaranOMara commented 4 years ago

Commits 1394d0cf and 64336d55 bring the execution time of the function eachoverlap for the data combinations eachoverlap(<:GenomicIntervalCollection{GenomicInterval}, <:AbstractVector{GenomicInterval}), and eachoverlap(<:AbstractVector{GenomicInterval}, <:AbstractVector{GenomicInterval}) to about 7s; this branch is now outperforming V1, which completes in around 17s.

CiaranOMara commented 4 years ago

This branch now generally outperforms master.

Bulk insertion

Any["collection", "bulk_insertion", ("GenomicIntervalCollection", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "bulk_insertion", ("IntervalCollection", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   +5.80% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("GenomicIntervalCollection", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  2.98 MiB
  allocs estimate:  5079
  --------------
  minimum time:     3.006 ms (0.00% GC)
  median time:      3.998 ms (0.00% GC)
  mean time:        4.615 ms (7.41% GC)
  maximum time:     22.225 ms (71.61% GC)
  --------------
  samples:          1083
  evals/sample:     1

("IntervalCollection", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  2.98 MiB
  allocs estimate:  5079
  --------------
  minimum time:     3.192 ms (0.00% GC)
  median time:      3.779 ms (0.00% GC)
  mean time:        4.435 ms (7.16% GC)
  maximum time:     20.592 ms (67.17% GC)
  --------------
  samples:          1126
  evals/sample:     1

Any["collection", "bulk_insertion", ("GenomicIntervalCollection{eltype(A)}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "bulk_insertion", ("IntervalCollection{Int}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -5.44% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("GenomicIntervalCollection{eltype(A)}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  2.98 MiB
  allocs estimate:  5079
  --------------
  minimum time:     3.005 ms (0.00% GC)
  median time:      3.693 ms (0.00% GC)
  mean time:        4.332 ms (7.38% GC)
  maximum time:     21.610 ms (66.71% GC)
  --------------
  samples:          1153
  evals/sample:     1

("IntervalCollection{Int}", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  2.98 MiB
  allocs estimate:  5079
  --------------
  minimum time:     3.198 ms (0.00% GC)
  median time:      3.905 ms (0.00% GC)
  mean time:        4.552 ms (6.84% GC)
  maximum time:     20.642 ms (68.02% GC)
  --------------
  samples:          1097
  evals/sample:     1

Eachoverlap

Any["collection", "eachoverlap", ("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "eachoverlap", ("Array{Interval{Int64},1}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -74.85% => improvement (5.00% tolerance)
  memory: -10.61% => improvement (1.00% tolerance)

("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.16 GiB
  allocs estimate:  34539435
  --------------
  minimum time:     4.021 s (14.78% GC)
  median time:      4.157 s (14.62% GC)
  mean time:        4.157 s (14.62% GC)
  maximum time:     4.294 s (14.46% GC)
  --------------
  samples:          2
  evals/sample:     1

("Array{Interval{Int64},1}", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.29 GiB
  allocs estimate:  43283859
  --------------
  minimum time:     16.533 s (4.24% GC)
  median time:      16.533 s (4.24% GC)
  mean time:        16.533 s (4.24% GC)
  maximum time:     16.533 s (4.24% GC)
  --------------
  samples:          1
  evals/sample:     1

Any["collection", "eachoverlap", ("Array{GenomicInterval{Int64},1}", "GenomicIntervalCollection{GenomicInterval{Int64}}")] vs Any["collection", "eachoverlap", ("Array{Interval{Int64},1}", "IntervalCollection{Int64}")]
BenchmarkTools.TrialJudgement: 
  time:   +11.80% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("Array{GenomicInterval{Int64},1}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial: 
  memory estimate:  918.94 MiB
  allocs estimate:  25884900
  --------------
  minimum time:     1.298 s (65.46% GC)
  median time:      1.532 s (54.49% GC)
  mean time:        1.481 s (57.64% GC)
  maximum time:     1.560 s (52.56% GC)
  --------------
  samples:          4
  evals/sample:     1

("Array{Interval{Int64},1}", "IntervalCollection{Int64}") BenchmarkTools.Trial: 
  memory estimate:  918.94 MiB
  allocs estimate:  25884900
  --------------
  minimum time:     1.278 s (59.41% GC)
  median time:      1.370 s (58.28% GC)
  mean time:        1.377 s (58.90% GC)
  maximum time:     1.487 s (54.53% GC)
  --------------
  samples:          4
  evals/sample:     1

Any["collection", "eachoverlap", ("GenomicIntervalCollection{GenomicInterval{Int64}}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "eachoverlap", ("IntervalCollection{Int64}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -72.04% => improvement (5.00% tolerance)
  memory: -10.60% => improvement (1.00% tolerance)

("GenomicIntervalCollection{GenomicInterval{Int64}}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.16 GiB
  allocs estimate:  34739438
  --------------
  minimum time:     4.405 s (12.23% GC)
  median time:      4.528 s (13.30% GC)
  mean time:        4.528 s (13.30% GC)
  maximum time:     4.651 s (14.31% GC)
  --------------
  samples:          2
  evals/sample:     1

("IntervalCollection{Int64}", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.30 GiB
  allocs estimate:  43498814
  --------------
  minimum time:     16.195 s (5.76% GC)
  median time:      16.195 s (5.76% GC)
  mean time:        16.195 s (5.76% GC)
  maximum time:     16.195 s (5.76% GC)
  --------------
  samples:          1
  evals/sample:     1

Any["collection", "eachoverlap", ("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}")] vs Any["collection", "eachoverlap", ("IntervalCollection{Int64}", "IntervalCollection{Int64}")]
BenchmarkTools.TrialJudgement: 
  time:   +6.69% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial: 
  memory estimate:  1.02 GiB
  allocs estimate:  34398359
  --------------
  minimum time:     7.617 s (10.04% GC)
  median time:      7.617 s (10.04% GC)
  mean time:        7.617 s (10.04% GC)
  maximum time:     7.617 s (10.04% GC)
  --------------
  samples:          1
  evals/sample:     1

("IntervalCollection{Int64}", "IntervalCollection{Int64}") BenchmarkTools.Trial: 
  memory estimate:  1.02 GiB
  allocs estimate:  34398346
  --------------
  minimum time:     7.139 s (9.43% GC)
  median time:      7.139 s (9.43% GC)
  mean time:        7.139 s (9.43% GC)
  maximum time:     7.139 s (9.43% GC)
  --------------
  samples:          1
  evals/sample:     1

Pushing to Collections

Any["collection", "push", ("push", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "push", ("push", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -0.34% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("push", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  14.50 MiB
  allocs estimate:  631688
  --------------
  minimum time:     47.670 ms (0.00% GC)
  median time:      76.773 ms (0.00% GC)
  mean time:        82.572 ms (9.41% GC)
  maximum time:     206.864 ms (61.27% GC)
  --------------
  samples:          61
  evals/sample:     1

("push", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  14.50 MiB
  allocs estimate:  631749
  --------------
  minimum time:     50.209 ms (0.00% GC)
  median time:      77.036 ms (0.00% GC)
  mean time:        81.841 ms (8.44% GC)
  maximum time:     285.655 ms (43.20% GC)
  --------------
  samples:          62
  evals/sample:     1

Interval functions

Any["intervals", ("leftposition", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("leftposition", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -12.81% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("leftposition", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     267.144 μs (0.00% GC)
  median time:      580.181 μs (0.00% GC)
  mean time:        564.506 μs (6.68% GC)
  maximum time:     3.660 ms (82.68% GC)
  --------------
  samples:          8792
  evals/sample:     1

("leftposition", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     306.015 μs (0.00% GC)
  median time:      665.454 μs (0.00% GC)
  mean time:        641.802 μs (5.44% GC)
  maximum time:     4.609 ms (78.07% GC)
  --------------
  samples:          7733
  evals/sample:     1

Any["intervals", ("metadata", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("metadata", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -12.84% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("metadata", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     257.692 μs (0.00% GC)
  median time:      581.910 μs (0.00% GC)
  mean time:        565.344 μs (6.60% GC)
  maximum time:     3.772 ms (76.06% GC)
  --------------
  samples:          8792
  evals/sample:     1

("metadata", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     349.042 μs (0.00% GC)
  median time:      667.611 μs (0.00% GC)
  mean time:        645.036 μs (5.35% GC)
  maximum time:     3.229 ms (71.90% GC)
  --------------
  samples:          7706
  evals/sample:     1

Any["intervals", ("rightposition", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("rightposition", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   -13.18% => improvement (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("rightposition", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     265.858 μs (0.00% GC)
  median time:      579.423 μs (0.00% GC)
  mean time:        564.358 μs (6.53% GC)
  maximum time:     3.556 ms (77.23% GC)
  --------------
  samples:          8807
  evals/sample:     1

("rightposition", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     347.733 μs (0.00% GC)
  median time:      667.408 μs (0.00% GC)
  mean time:        642.656 μs (5.38% GC)
  maximum time:     3.572 ms (69.09% GC)
  --------------
  samples:          7734
  evals/sample:     1

Any["intervals", ("sort", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("sort", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement: 
  time:   +0.37% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

("sort", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.14 MiB
  allocs estimate:  4
  --------------
  minimum time:     6.757 ms (0.00% GC)
  median time:      8.651 ms (0.00% GC)
  mean time:        8.807 ms (0.48% GC)
  maximum time:     14.018 ms (25.15% GC)
  --------------
  samples:          564
  evals/sample:     1

("sort", "Array{Interval{Int64},1}") BenchmarkTools.Trial: 
  memory estimate:  1.14 MiB
  allocs estimate:  4
  --------------
  minimum time:     7.168 ms (0.00% GC)
  median time:      8.619 ms (0.00% GC)
  mean time:        8.722 ms (0.46% GC)
  maximum time:     12.910 ms (29.72% GC)
  --------------
  samples:          570
  evals/sample:     1

CiaranOMara commented 4 years ago

@BenJWard, are there other concerns to address or breaking changes that should be included?

CiaranOMara commented 4 years ago

There are some trivial improvements here that v1 will benefit from. I'll open some PRs for those. This will minimise the diff of this PR and focus this PR on breaking changes and new features.

BioJulia / GenomicFeatures.jl

Release/v2.0.0 #20

Types of changes

:clipboard: Additional detail

:ballot_box_with_check: Checklist

Codecov Report

Bulk insertion

Eachoverlap

Pushing to Collections

Interval functions