diegozea / MIToS.jl

A Julia package to analyze protein sequences, structures, and evolutionary information
https://diegozea.github.io/MIToS.jl/stable/
Other
75 stars 18 forks source link

PFAM is not always round-trippable? #65

Closed timholy closed 1 year ago

timholy commented 1 year ago

If I load a Pfam file, perform some manipulations, and then write it back out to disk, I can't load what I write:

julia> using MIToS, MIToS.Pfam, MIToS.MSA

julia> pfamfile = "PF18883.alignment.full.gz"
"PF18883.alignment.full.gz"

julia> pfamfile = downloadpfam("PF18883"; filename=pfamfile)
"PF18883.alignment.full.gz"

julia> msa = read(pfamfile, Stockholm, generatemapping=true, useidcoordinates=true);

julia> filtersequences!(msa, coverage(msa) .>= 0.9);

julia> open("PF18883.alignment.fullcoverage", "w") do io
           print(io, msa, Stockholm)
       end

julia> msa2 =  read("PF18883.alignment.fullcoverage", Stockholm, generatemapping=true, useidcoordinates=true)
ERROR: The last residue number 685 in sequence 1 isn't the end coordinate 686
Stacktrace:
 [1] _to_msa_mapping(sequences::Vector{String}, ids::Vector{String})
   @ MIToS.MSA ~/.julia/dev/MIToS/src/MSA/GeneralParserMethods.jl:56
 [2] _generate_annotated_msa(annot::Annotations, IDS::Vector{String}, SEQS::Vector{String}, keepinserts::Bool, generatemapping::Bool, useidcoordinates::Bool, deletefullgaps::Bool)
   @ MIToS.MSA ~/.julia/dev/MIToS/src/MSA/GeneralParserMethods.jl:142
 [3] #parse#25
   @ ~/.julia/dev/MIToS/src/MSA/Stockholm.jl:93 [inlined]
 [4] #parse#29
   @ ~/.julia/dev/MIToS/src/MSA/Stockholm.jl:133 [inlined]
 [5] _read(::String, ::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
   @ MIToS.Utils ~/.julia/dev/MIToS/src/Utils/Read.jl:105
 [6] read(::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
   @ MIToS.Utils ~/.julia/dev/MIToS/src/Utils/Read.jl:135
 [7] top-level scope
   @ REPL[7]:1
timholy commented 1 year ago

Note this requires #64

diegozea commented 1 year ago

Hi Tim!

That is intended; when using generatemapping=true and useidcoordinates=true we are using the sequence coordinates in the sequence name of the Pfam MSA for generating the residue mapping. But, it fails because some columns have been removed. Therefore, the Pfam names indicate more residues than those in the sequences (as we do not change sequence names). So, the last read should not include the generatemapping and useidcoordinates keyword arguments, as MIToS already saved the mapping annotations in the alignment (when we use the Stockholm format):

image

I have added a warning to inform the user about the pre-existing modifications and made the error a little more explicit:

image

Best regards

timholy commented 1 year ago

Thanks for the explanation, and the warning!