Closed CiaranOMara closed 4 years ago
Ok the first thing to do I suppose is to rebase it onto the current state of master and get rid of file conflicts.
Merging #20 into master will increase coverage by
3.67%
. The diff coverage is67.69%
.
@@ Coverage Diff @@
## master #20 +/- ##
==========================================
+ Coverage 73.7% 77.37% +3.67%
==========================================
Files 32 8 -24
Lines 1715 389 -1326
==========================================
- Hits 1264 301 -963
+ Misses 451 88 -363
Impacted Files | Coverage Δ | |
---|---|---|
src/GenomicFeatures.jl | 100% <ø> (ø) |
|
src/queue.jl | 52.63% <0%> (+0.9%) |
:arrow_up: |
src/coverage.jl | 92.98% <100%> (-0.2%) |
:arrow_down: |
src/position.jl | 42.1% <42.1%> (ø) |
|
src/overlap.jl | 82.6% <60.86%> (+0.64%) |
:arrow_up: |
src/intervalcollection.jl | 80.61% <62.06%> (+3.6%) |
:arrow_up: |
src/interval.jl | 70.31% <78.94%> (-3.22%) |
:arrow_down: |
src/strand.jl | 88.37% <80%> (+4.65%) |
:arrow_up: |
... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 9e256b2...63eb6dd. Read the comment docs.
Awesome CI is good for OSX and Linux, for Windows we need to edit the appveyor config. It looks like it doesen't like the ' character being used to delimit strings.
Would you object to having appveyor removed and having Windows added to the Travis test matrix?
If travis can do that then yes absolutely, I thought it could only do OS X and Linux!
On Thu, Nov 14, 2019 at 22:51, CiaranOMara notifications@github.com wrote:
Would you object to having appveyor removed and having Windows added to the Travis test matrix?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
The GenomicPosition works with the GenomicIntervalCollection. In fact, the IntervalCollection can hold mixed types and still work. There are however some weaknesses around the Queue (L443-L455) that are sensitive to parameter order, but they can be addressed later.
Given that this would be a major release, do you think the deprecation warnings are necessary? I'd prefer not to include them as I think they add complexity and clutter.
The other thing I am uncertain about is whether the historic notes regarding iteration should be stripped out - I didn't look at iteration in any detail. Would it be useful to keep them?
Do the mixed types in the collection create type uncertainty? I'm wondering if it might clobber performance? I think deprecation warnings or even deprecated functions - given this is v2.0, can probably just be stripped out.
Below is an attempt at using BenchmarkTools.jl to compare the performance of the eachoverlap
function in GenomicFeatures v1 and this PR. The first set of benchmark results are for GenomicFeatures v1 - the script is available at BenchmarkGenomicFeatures.jl.
shell> sh run.sh
...
("Array{Interval{Int64},1}", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.28 GiB
allocs estimate: 43084035
--------------
minimum time: 20.106 s (20.20% GC)
median time: 20.106 s (20.20% GC)
mean time: 20.106 s (20.20% GC)
maximum time: 20.106 s (20.20% GC)
--------------
samples: 1
evals/sample: 1
("IntervalCollection{Int64}", "IntervalCollection{Int64}") BenchmarkTools.Trial:
memory estimate: 1.02 GiB
allocs estimate: 34379882
--------------
minimum time: 11.426 s (5.89% GC)
median time: 11.426 s (5.89% GC)
mean time: 11.426 s (5.89% GC)
maximum time: 11.426 s (5.89% GC)
--------------
samples: 1
evals/sample: 1
The second set of benchmarking results is for this PR.
shell> julia --color="yes" --project=benchmark/ benchmark/benchmarks.jl
...
("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 7.28 GiB
allocs estimate: 170812368
--------------
minimum time: 34.424 s (29.64% GC)
median time: 34.424 s (29.64% GC)
mean time: 34.424 s (29.64% GC)
maximum time: 34.424 s (29.64% GC)
--------------
samples: 1
evals/sample: 1
("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial:
memory estimate: 1.02 GiB
allocs estimate: 34379882
--------------
minimum time: 13.039 s (22.26% GC)
median time: 13.039 s (22.26% GC)
mean time: 13.039 s (22.26% GC)
maximum time: 13.039 s (22.26% GC)
--------------
samples: 1
evals/sample: 1
Commit https://github.com/BioJulia/GenomicFeatures.jl/pull/20/commits/fb8e364a08bb9219fdb64758909989254dedc4f1 addresses the issue of memory allocation when Array{GenomicInterval{Int64},1} is present. Memory allocation is down from around 7.29 GiB and is now inline with v1's allocation of about 1.29 GiB. The commit also removes the Queue's sensitivities to mixed types (L454-R455).
Commits 1394d0cf and 64336d55 bring the execution time of the function eachoverlap
for the data combinations eachoverlap(<:GenomicIntervalCollection{GenomicInterval}, <:AbstractVector{GenomicInterval})
, and eachoverlap(<:AbstractVector{GenomicInterval}, <:AbstractVector{GenomicInterval})
to about 7s; this branch is now outperforming V1, which completes in around 17s.
This branch now generally outperforms master
.
Any["collection", "bulk_insertion", ("GenomicIntervalCollection", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "bulk_insertion", ("IntervalCollection", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: +5.80% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("GenomicIntervalCollection", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 2.98 MiB
allocs estimate: 5079
--------------
minimum time: 3.006 ms (0.00% GC)
median time: 3.998 ms (0.00% GC)
mean time: 4.615 ms (7.41% GC)
maximum time: 22.225 ms (71.61% GC)
--------------
samples: 1083
evals/sample: 1
("IntervalCollection", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 2.98 MiB
allocs estimate: 5079
--------------
minimum time: 3.192 ms (0.00% GC)
median time: 3.779 ms (0.00% GC)
mean time: 4.435 ms (7.16% GC)
maximum time: 20.592 ms (67.17% GC)
--------------
samples: 1126
evals/sample: 1
Any["collection", "bulk_insertion", ("GenomicIntervalCollection{eltype(A)}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "bulk_insertion", ("IntervalCollection{Int}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -5.44% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("GenomicIntervalCollection{eltype(A)}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 2.98 MiB
allocs estimate: 5079
--------------
minimum time: 3.005 ms (0.00% GC)
median time: 3.693 ms (0.00% GC)
mean time: 4.332 ms (7.38% GC)
maximum time: 21.610 ms (66.71% GC)
--------------
samples: 1153
evals/sample: 1
("IntervalCollection{Int}", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 2.98 MiB
allocs estimate: 5079
--------------
minimum time: 3.198 ms (0.00% GC)
median time: 3.905 ms (0.00% GC)
mean time: 4.552 ms (6.84% GC)
maximum time: 20.642 ms (68.02% GC)
--------------
samples: 1097
evals/sample: 1
Any["collection", "eachoverlap", ("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "eachoverlap", ("Array{Interval{Int64},1}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -74.85% => improvement (5.00% tolerance)
memory: -10.61% => improvement (1.00% tolerance)
("Array{GenomicInterval{Int64},1}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.16 GiB
allocs estimate: 34539435
--------------
minimum time: 4.021 s (14.78% GC)
median time: 4.157 s (14.62% GC)
mean time: 4.157 s (14.62% GC)
maximum time: 4.294 s (14.46% GC)
--------------
samples: 2
evals/sample: 1
("Array{Interval{Int64},1}", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.29 GiB
allocs estimate: 43283859
--------------
minimum time: 16.533 s (4.24% GC)
median time: 16.533 s (4.24% GC)
mean time: 16.533 s (4.24% GC)
maximum time: 16.533 s (4.24% GC)
--------------
samples: 1
evals/sample: 1
Any["collection", "eachoverlap", ("Array{GenomicInterval{Int64},1}", "GenomicIntervalCollection{GenomicInterval{Int64}}")] vs Any["collection", "eachoverlap", ("Array{Interval{Int64},1}", "IntervalCollection{Int64}")]
BenchmarkTools.TrialJudgement:
time: +11.80% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("Array{GenomicInterval{Int64},1}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial:
memory estimate: 918.94 MiB
allocs estimate: 25884900
--------------
minimum time: 1.298 s (65.46% GC)
median time: 1.532 s (54.49% GC)
mean time: 1.481 s (57.64% GC)
maximum time: 1.560 s (52.56% GC)
--------------
samples: 4
evals/sample: 1
("Array{Interval{Int64},1}", "IntervalCollection{Int64}") BenchmarkTools.Trial:
memory estimate: 918.94 MiB
allocs estimate: 25884900
--------------
minimum time: 1.278 s (59.41% GC)
median time: 1.370 s (58.28% GC)
mean time: 1.377 s (58.90% GC)
maximum time: 1.487 s (54.53% GC)
--------------
samples: 4
evals/sample: 1
Any["collection", "eachoverlap", ("GenomicIntervalCollection{GenomicInterval{Int64}}", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "eachoverlap", ("IntervalCollection{Int64}", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -72.04% => improvement (5.00% tolerance)
memory: -10.60% => improvement (1.00% tolerance)
("GenomicIntervalCollection{GenomicInterval{Int64}}", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.16 GiB
allocs estimate: 34739438
--------------
minimum time: 4.405 s (12.23% GC)
median time: 4.528 s (13.30% GC)
mean time: 4.528 s (13.30% GC)
maximum time: 4.651 s (14.31% GC)
--------------
samples: 2
evals/sample: 1
("IntervalCollection{Int64}", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.30 GiB
allocs estimate: 43498814
--------------
minimum time: 16.195 s (5.76% GC)
median time: 16.195 s (5.76% GC)
mean time: 16.195 s (5.76% GC)
maximum time: 16.195 s (5.76% GC)
--------------
samples: 1
evals/sample: 1
Any["collection", "eachoverlap", ("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}")] vs Any["collection", "eachoverlap", ("IntervalCollection{Int64}", "IntervalCollection{Int64}")]
BenchmarkTools.TrialJudgement:
time: +6.69% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("GenomicIntervalCollection{GenomicInterval{Int64}}", "GenomicIntervalCollection{GenomicInterval{Int64}}") BenchmarkTools.Trial:
memory estimate: 1.02 GiB
allocs estimate: 34398359
--------------
minimum time: 7.617 s (10.04% GC)
median time: 7.617 s (10.04% GC)
mean time: 7.617 s (10.04% GC)
maximum time: 7.617 s (10.04% GC)
--------------
samples: 1
evals/sample: 1
("IntervalCollection{Int64}", "IntervalCollection{Int64}") BenchmarkTools.Trial:
memory estimate: 1.02 GiB
allocs estimate: 34398346
--------------
minimum time: 7.139 s (9.43% GC)
median time: 7.139 s (9.43% GC)
mean time: 7.139 s (9.43% GC)
maximum time: 7.139 s (9.43% GC)
--------------
samples: 1
evals/sample: 1
Any["collection", "push", ("push", "Array{GenomicInterval{Int64},1}")] vs Any["collection", "push", ("push", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -0.34% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("push", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 14.50 MiB
allocs estimate: 631688
--------------
minimum time: 47.670 ms (0.00% GC)
median time: 76.773 ms (0.00% GC)
mean time: 82.572 ms (9.41% GC)
maximum time: 206.864 ms (61.27% GC)
--------------
samples: 61
evals/sample: 1
("push", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 14.50 MiB
allocs estimate: 631749
--------------
minimum time: 50.209 ms (0.00% GC)
median time: 77.036 ms (0.00% GC)
mean time: 81.841 ms (8.44% GC)
maximum time: 285.655 ms (43.20% GC)
--------------
samples: 62
evals/sample: 1
Any["intervals", ("leftposition", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("leftposition", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -12.81% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("leftposition", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 267.144 μs (0.00% GC)
median time: 580.181 μs (0.00% GC)
mean time: 564.506 μs (6.68% GC)
maximum time: 3.660 ms (82.68% GC)
--------------
samples: 8792
evals/sample: 1
("leftposition", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 306.015 μs (0.00% GC)
median time: 665.454 μs (0.00% GC)
mean time: 641.802 μs (5.44% GC)
maximum time: 4.609 ms (78.07% GC)
--------------
samples: 7733
evals/sample: 1
Any["intervals", ("metadata", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("metadata", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -12.84% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("metadata", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 257.692 μs (0.00% GC)
median time: 581.910 μs (0.00% GC)
mean time: 565.344 μs (6.60% GC)
maximum time: 3.772 ms (76.06% GC)
--------------
samples: 8792
evals/sample: 1
("metadata", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 349.042 μs (0.00% GC)
median time: 667.611 μs (0.00% GC)
mean time: 645.036 μs (5.35% GC)
maximum time: 3.229 ms (71.90% GC)
--------------
samples: 7706
evals/sample: 1
Any["intervals", ("rightposition", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("rightposition", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: -13.18% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("rightposition", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 265.858 μs (0.00% GC)
median time: 579.423 μs (0.00% GC)
mean time: 564.358 μs (6.53% GC)
maximum time: 3.556 ms (77.23% GC)
--------------
samples: 8807
evals/sample: 1
("rightposition", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 347.733 μs (0.00% GC)
median time: 667.408 μs (0.00% GC)
mean time: 642.656 μs (5.38% GC)
maximum time: 3.572 ms (69.09% GC)
--------------
samples: 7734
evals/sample: 1
Any["intervals", ("sort", "Array{GenomicInterval{Int64},1}")] vs Any["intervals", ("sort", "Array{Interval{Int64},1}")]
BenchmarkTools.TrialJudgement:
time: +0.37% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
("sort", "Array{GenomicInterval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.14 MiB
allocs estimate: 4
--------------
minimum time: 6.757 ms (0.00% GC)
median time: 8.651 ms (0.00% GC)
mean time: 8.807 ms (0.48% GC)
maximum time: 14.018 ms (25.15% GC)
--------------
samples: 564
evals/sample: 1
("sort", "Array{Interval{Int64},1}") BenchmarkTools.Trial:
memory estimate: 1.14 MiB
allocs estimate: 4
--------------
minimum time: 7.168 ms (0.00% GC)
median time: 8.619 ms (0.00% GC)
mean time: 8.722 ms (0.46% GC)
maximum time: 12.910 ms (29.72% GC)
--------------
samples: 570
evals/sample: 1
@BenJWard, are there other concerns to address or breaking changes that should be included?
There are some trivial improvements here that v1 will benefit from. I'll open some PRs for those. This will minimise the diff of this PR and focus this PR on breaking changes and new features.
Types of changes
This PR removes the parsing machinery from
GenomicFeatures
. The removal of the parsing machinery allows theGenomicFeatures
package to focus on genomic interval/feature algorithms.:clipboard: Additional detail
The
GenomicFeatures.Interval
type was renamed to un-shadow theIntervalTrees.Interval
type.The Interval collection can now handle
AbstractGenomicInterval
types.:ballot_box_with_check: Checklist
docs/src/
.[UNRELEASED]
section of the manually curatedCHANGELOG.md
file for this repository.