Closed xiaodaigh closed 5 years ago
There were benchmarks we did some time ago on actual biological data files. The numbers were presented during a juliacon talk, I might have them somewhere but it wasn't my work or my talk so I might not have kept them. Bear in mind this package currently performs a legacy role in BioJulia, and you should probably be using TranscodingStreams.jl instead, which does what BufferedStreams does, plus a lot more.
I think you may need to profile to find out where exactly the bottlenecks are, if I benchmark using integers instead of strings:
using BenchmarkTools
using BufferedStreams
julia> y = rand(UInt16,1_000_000)
1000000-element Array{UInt16,1}:
0x53c2
0xd42d
0xbc43
0x7c79
0x0be6
0x4e06
0xd126
0x7e1f
0x1e67
0x9992
0xdabe
0xa8c5
0x3653
0x3551
0xe526
0xc062
0x2aac
0x17c1
0xe99f
0x69b3
0x6efe
0x4f42
0x3f31
0x12e2
0x92ed
⋮
0x4b33
0xa6c9
0x64c1
0x5080
0x75bf
0xb779
0x4f14
0x020a
0x7d62
0x2eb2
0x3e63
0xa036
0xc77f
0x2bf9
0xf21c
0xa734
0x3d23
0x3e1f
0x975f
0xfbb7
0x083f
0x9fb9
0x2c72
0xbaee
0xc697
julia> function fn(x)
io = BufferedOutputStream(open("/tmp/bin.bin", "w"))
for xi in x
write(io, xi)
end
close(io)
end
fn (generic function with 1 method)
julia> @benchmark fn($y)
BenchmarkTools.Trial:
memory estimate: 15.39 MiB
allocs estimate: 1000029
--------------
minimum time: 12.579 ms (0.00% GC)
median time: 15.038 ms (11.38% GC)
mean time: 14.805 ms (8.38% GC)
maximum time: 27.942 ms (7.10% GC)
--------------
samples: 338
evals/sample: 1
julia> function gn(x)
io = open("/tmp/bin.bin", "w")
for xi in x
write(io, xi)
end
close(io)
end
gn (generic function with 1 method)
julia> @benchmark gn($y)
BenchmarkTools.Trial:
memory estimate: 15.26 MiB
allocs estimate: 1000010
--------------
minimum time: 17.986 ms (0.00% GC)
median time: 20.193 ms (8.41% GC)
mean time: 19.998 ms (6.13% GC)
maximum time: 29.700 ms (6.71% GC)
--------------
samples: 250
evals/sample: 1
The write methods of buffered output streams are written in terms of a single UInt8 input or an array of UInt8.
julia> z = rand(UInt8,1_000_000)
1000000-element Array{UInt8,1}:
0x08
0xf1
0x76
0xc6
0x55
0x4e
0xb3
0x51
0x43
0xdc
0xcf
0x52
0xe5
0xaa
0x60
0x58
0x78
0x7c
0xf7
0xda
0x1c
0x32
0xff
0x08
0x85
⋮
0x99
0x56
0x1c
0x24
0xbb
0xc2
0xc7
0x27
0x98
0x55
0x1e
0x10
0x65
0x74
0x72
0xf0
0x1a
0x06
0x00
0xef
0x13
0x80
0xfd
0x25
0xd2
julia> @benchmark fn($z)
BenchmarkTools.Trial:
memory estimate: 129.20 KiB
allocs estimate: 21
--------------
minimum time: 1.648 ms (0.00% GC)
median time: 2.152 ms (0.00% GC)
mean time: 2.267 ms (0.75% GC)
maximum time: 13.276 ms (0.00% GC)
--------------
samples: 2197
evals/sample: 1
julia> @benchmark gn($z)
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 10
--------------
minimum time: 5.561 ms (0.00% GC)
median time: 6.294 ms (0.00% GC)
mean time: 6.415 ms (0.00% GC)
maximum time: 14.218 ms (0.00% GC)
--------------
samples: 778
evals/sample: 1
So it maybe you're doing a LOT of work coercing types or doing something else that has little to do with the streams - if the BufferedStream is calling a write method that has to wait for disk less often than your base IO stream, then it's doing it's job. Any other performance oddities need to be chased down through profiling.
Do you have a set of benchmarks to show that it's actually faster?
this doesn't indicate that it's faster on Julia 1.2