BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

StringField broken on Julia 0.6 #469

Closed phaverty closed 6 years ago

phaverty commented 7 years ago

The StringField used as the chromosome name does not work in 0.6 as String no longer has a .data field. Now that strings are fast, how about just using that? Since chromosome names will not be edited, how about just using Symbol? (I think chromosome names are just used as hash keys and for printing, right?)

bicycle1885 commented 7 years ago

I'd like to deprecate StringField in the next version of BIo.jl because the new design of file readers does not require StringField for performance. Actually, new modular packages (e.g. https://github.com/BioJulia/GenomicFeatures.jl, https://github.com/BioJulia/BioAlignments.jl) do not use StringFields at all.

So, my answers to your suggestions:

Now that strings are fast, how about just using that?

I will do that 👍.

how about just using Symbol?

I don't think it is a good way because Symbol doesn't support many operations String has. People may want to trim prefixes from chromosome names (e.g. UCSC => NCBI style). Also, my quick benchmark suggests conversion from Vector{UInt8} to Symbol takes more time than that of Vector{UInt8} to String.

julia> @benchmark Symbol($(b"chr1"))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.170 ns (0.00% GC)
  median time:      69.646 ns (0.00% GC)
  mean time:        75.515 ns (0.00% GC)
  maximum time:     270.669 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     974

julia> @benchmark String($(b"chr1"))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.529 ns (0.00% GC)
  median time:      9.608 ns (0.00% GC)
  mean time:        10.300 ns (0.00% GC)
  maximum time:     62.710 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999
kescobo commented 7 years ago

I'd like to deprecate StringField in the next version of BIo.jl because the new design of file readers does not require StringField for performance.

@bicycle1885 This makes good sense to me!

bicycle1885 commented 7 years ago

The benchmark above is not fair. A more realistic one is shown below but String is faster still.

julia> @benchmark String($(b"chr1")[1:3])
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     45.155 ns (0.00% GC)
  median time:      47.632 ns (0.00% GC)
  mean time:        61.487 ns (15.14% GC)
  maximum time:     2.281 μs (92.83% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark Symbol($(b"chr1")[1:3])
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     109.491 ns (0.00% GC)
  median time:      110.975 ns (0.00% GC)
  mean time:        121.470 ns (4.27% GC)
  maximum time:     1.847 μs (87.14% GC)
  --------------
  samples:          10000
  evals/sample:     921