BioJulia / BigBed.jl

MIT License
4 stars 2 forks source link

Order of chromosomes in BigBed.Writer #2

Open jonathanBieler opened 4 years ago

jonathanBieler commented 4 years ago

I'm trying to write data to a BigBed file on each human chromosome, I defined my writer as such :

writer = BigBed.Writer(file, [(chr, length(genome[chr])) for chr in chrs])

With chrsbeing ["1", "2", ...]. Then I'm looping on chrs and do some write operations, but I'm getting a :

ArgumentError: disordered intervals

Because in the writer the chromosomes are getting reordered as : ["1", "10", "11"].

I managed to get around it by getting the chromosomes in the right order with:

ochrs = collect(values(writer.chromnames))[sortperm(collect(keys(writer.chromnames)))]

Would it be possible to have the writer keep the ordering it's given ? (maybe using an OrderedDict) or is there a good reason why it gets reordered ?

Alternatively a chromlist method to get the chromosome in the right order would help.

CiaranOMara commented 4 years ago

@jonathanBieler, does the following cover your idea of a chromlist method?

function Base.isless(::Int, ::Char)
    return true
end

function Base.isless(::Char, ::Int)
    return false
end

function seqname_isless(str1::String, str2::String) :: Bool

    function parse_seqname(str::String) :: Vector{Union{Char, Int}}
        arr = Vector{Char}(str)

        arr = convert(Vector{Union{Char, Int}}, arr)

        m = match(r"(\d+)", str)

        if m !== nothing
            for (capture, offset) in zip(reverse(m.captures), reverse(m.offsets))
                splice!(arr, UnitRange(offset, offset + length(capture) -1), parse(Int, capture))
            end
        end

        return arr

    end

    return parse_seqname(str1) < parse_seqname(str2)

end
julia> seqname_isless("chr1", "chrM")
true

julia> seqname_isless("chrM", "chr1")
false

julia> seqnames = ["chr11", "chr10", "chr01", "chr100", "chr1" , "chrM", "chr010"];

julia> sort(seqnames)
7-element Array{String,1}:
 "chr01" 
 "chr010"
 "chr1"  
 "chr10" 
 "chr100"
 "chr11" 
 "chrM"  

julia> sort(seqnames,lt=seqname_isless)
7-element Array{String,1}:
 "chr01" 
 "chr1"  
 "chr10" 
 "chr010"
 "chr11" 
 "chr100"
 "chrM"  

julia> seqnames = string.(1:11);

julia> sort(seqnames)
11-element Array{String,1}:
 "1" 
 "10"
 "11"
 "2" 
 "3" 
 "4" 
 "5" 
 "6" 
 "7" 
 "8" 
 "9" 

julia> sort(seqnames,lt=seqname_isless)
11-element Array{String,1}:
 "1" 
 "2" 
 "3" 
 "4" 
 "5" 
 "6" 
 "7" 
 "8" 
 "9" 
 "10"
 "11"
jonathanBieler commented 4 years ago

My issue isn't that the chromosome list is in a specific order or another, just that you need to know (or be able to set) the order it's stored internally in the Writer to be able to write without getting a disordered intervals error. So it would need to be something like I wrote above :

chromlist(writer) = 
    collect(values(writer.chromnames)[sortperm(collect(keys(writer.chromnames)))]
jonathanBieler commented 4 years ago

Here's an MWE :

    output = open("data.bb", "w")
    writer = BigBed.Writer(output, [("1", 12345), ("2", 9100), ("10", 123)])

    write(writer, ("1", 101, 150, "gene 1"))
    write(writer, ("2", 211, 250, "gene 2"))
    write(writer, ("10", 211, 250, "gene 3"))
    close(writer)
jonathanBieler commented 4 years ago

Actually it would be better if I could specify the order in the Writer ; I have a large text file I want to read line by line and write into a BigBed, but I don't want to reorder it to match the writer internal order.

Just commenting the sort here solves the problem (might create other though, I'm not sure) :

https://github.com/BioJulia/GenomicFeatures.jl/blob/8fc34ff680f5e742e25d2bd3d4722cb33fbe3cd5/src/bbi/btree.jl#L116

But conceptually I would leave the choice of ordering to the user (since it might be imposed by external constrains, like a file you got from another tool).

CiaranOMara commented 4 years ago

@jonathanBieler, I agree that this is an issue/annoyance and have made a note of it in https://github.com/BioJulia/BigBed.jl/projects/1.