BioJulia / XAM.jl

Parse and process SAM and BAM formatted files
MIT License
27 stars 13 forks source link

Type instability when reading a BAM file #18

Open phaverty opened 6 years ago

phaverty commented 6 years ago

I don't know if this is a significant bottleneck, but I noticed some type instability when looping over BAM file records:

function read_bam(bam_file)

iterate over BAM records

chr = String[]
left_pos = Int64[]
right_pos = Int64[]
for record in reader
    push!(chr, BAM.refname(record))
    push!(left_pos, leftposition(record))
    push!(right_pos, rightposition(record))
end
DataFrame(chr = chr, left = left_pos, right = right_pos)

end

@code_warntype read_bam(some_file)

!julia> @code_warntype read_bam(bam_file) Variables:

self#

bam_file record::ANY

temp#::ANY

chr::Array{String,1} left_pos::Array{Int64,1} right_pos::Array{Int64,1} itemT

Body: begin chr::Array{String,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{String,1}, svec(Any, Int64), Array{String,1}, 0, 0, 0)) # line 4:
left_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 5:
right_pos::Array{Int64,1} = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, 0, 0)) # line 6:
SSAValue(0) = Main.reader

temp#::ANY = (Base.start)(SSAValue(0))::ANY

   9:
   unless !((Base.done)(SSAValue(0), #temp#::ANY)::ANY)::ANY goto 39                                                                                                                                                                     
   SSAValue(1) = (Base.next)(SSAValue(0), #temp#::ANY)::ANY                                                                                                                                                                              
   record::ANY = (Core.getfield)(SSAValue(1), 1)::ANY
   #temp#::ANY = (Core.getfield)(SSAValue(1), 2)::ANY # line 7:                                                                                                                                                                          
   SSAValue(5) = ($(QuoteNode(BioAlignments.BAM.refname)))(record::ANY)::String
   $(Expr(:inbounds, false))
   # meta: location array.jl push! 652                                                                                                                                                                                                   
   SSAValue(6) = (Base.bitcast)(UInt64, (Base.check_top_bit)(1)::Int64)
   $(Expr(:foreigncall, :(:jl_array_grow_end), Void, svec(Any, UInt64), :(chr), 0, SSAValue(6), 0)) # line 653:                                                                                                                          
   # meta: location abstractarray.jl endof 134                                                                                                                                                                                           
   # meta: location abstractarray.jl linearindices 99                                                                                                                                                                                    
   # meta: location abstractarray.jl indices1 71                                                                                                                                                                                         
   # meta: location abstractarray.jl indices 64                                                                                                                                                                                          
   SSAValue(9) = (Base.arraysize)(chr::Array{String,1}, 1)::Int64
   # meta: pop location                                                                                                                                                                                                                  
   # meta: pop location                                                                                                                                                                                                                  
   # meta: pop location                                                                                                                                                                                                                  
   # meta: pop location                                                                                                                                                                                                                  
   (Base.arrayset)(chr::Array{String,1}, SSAValue(5), (Base.select_value)((Base.slt_int)(SSAValue(9), 0)::Bool, 0, SSAValue(9))::Int64)::Array{String,1}
   # meta: pop location                                                                                                                                                                                                                  
   $(Expr(:inbounds, :pop)) # line 8:                                                                                                                                                                                                    
   (Main.push!)(left_pos::Array{Int64,1}, (Main.leftposition)(record::ANY)::ANY)::Array{Int64,1} # line 9:                                                                                                                               
   (Main.push!)(right_pos::Array{Int64,1}, (Main.rightposition)(record::ANY)::ANY)::Array{Int64,1}
   37:
   goto 9
   39:  # line 11:                                                                                                                                                                                                                       
   return $(Expr(:invoke, MethodInstance for (::Core.#kw#Type)(::Array{Any,1}, ::Type{DataFrames.DataFrame}), :($(QuoteNode(Type))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any)$

end::DataFrames.DataFrame

I think the ANY type for record and temp indicate some type instability

TransGirlCodes commented 6 years ago

Thats very interesting that record should be of type ANY. At first glance, I think that should not be happening. Thanks for flagging this!

phaverty commented 6 years ago

Perhaps your iterator has the same trouble as those mentioned here: https://github.com/JuliaLang/julia/pull/26852

?

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Wed, Apr 18, 2018 at 2:37 PM, Ben J. Ward notifications@github.com wrote:

Thats very interesting that record should be of type ANY. At first glance, I think that should not be happening. Thanks for flagging this!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BioJulia/BioAlignments.jl/issues/10#issuecomment-382537541, or mute the thread https://github.com/notifications/unsubscribe-auth/AH02K9Rt27dTdXcvOW_uHdyZVZMMI4i0ks5tp7IVgaJpZM4TaWLV .

phaverty commented 6 years ago

When following the iteration suggestion in the "Performance Tips" section, no type instability is observed. e.g. for:

reader = open(BAM.Reader, "data.bam") record = BAM.Record() while !eof(reader) read!(reader, record)

do something

end

Variables:

self#

genome_name::String reader::BioAlignments.BAM.Reader{IOStream} info::GenomicVectors.GenomeInfo{Int64} chr::Array{String,1} left_pos::Array{Int64,1} right_pos::Array{Int64,1} record::BioAlignments.BAM.Record itemT@_9
itemT@_10 itemT@_11

Body:

Maybe just update the docs to make this the standard way of looping over a BAM?